1
|
Price MN, Arkin AP. Interactive tools for functional annotation of bacterial genomes. Database (Oxford) 2024; 2024:baae089. [PMID: 39241109 PMCID: PMC11378808 DOI: 10.1093/database/baae089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Revised: 07/29/2024] [Accepted: 08/09/2024] [Indexed: 09/08/2024]
Abstract
Automated annotations of protein functions are error-prone because of our lack of knowledge of protein functions. For example, it is often impossible to predict the correct substrate for an enzyme or a transporter. Furthermore, much of the knowledge that we do have about the functions of proteins is missing from the underlying databases. We discuss how to use interactive tools to quickly find different kinds of information relevant to a protein's function. Many of these tools are available via PaperBLAST (http://papers.genomics.lbl.gov). Combining these tools often allows us to infer a protein's function. Ideally, accurate annotations would allow us to predict a bacterium's capabilities from its genome sequence, but in practice, this remains challenging. We describe interactive tools that infer potential capabilities from a genome sequence or that search a genome to find proteins that might perform a specific function of interest. Database URL: http://papers.genomics.lbl.gov.
Collapse
Affiliation(s)
- Morgan N Price
- Environmental Genomics & Systems Biology, Lawrence Berkeley National Laboratory, 1 Cyclotron Rd, Berkeley, CA 94720, United States
| | - Adam P Arkin
- Environmental Genomics & Systems Biology, Lawrence Berkeley National Laboratory, 1 Cyclotron Rd, Berkeley, CA 94720, United States
| |
Collapse
|
2
|
Joachimiak MP, Caufield JH, Harris NL, Kim H, Mungall CJ. Gene Set Summarization Using Large Language Models. ARXIV 2024:arXiv:2305.13338v3. [PMID: 37292480 PMCID: PMC10246080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpreting gene lists can also be framed as a textual summarization task, enabling Large Language Models (LLMs) to use scientific texts directly and avoid reliance on a KB. TALISMAN (Terminological ArtificiaL Intelligence SuMmarization of Annotation and Narratives) uses generative AI to perform gene set function summarization as a complement to standard enrichment analysis. This method can use different sources of gene functional information: (1) structured text derived from curated ontological KB annotations, (2) ontology-free narrative gene summaries, or (3) direct retrieval from the model. We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for an input gene set. However, LLM-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant. Crucially, in our experiments these methods were rarely able to recapitulate the most precise and informative term from standard enrichment analysis. We also observe minor differences depending on prompt input information, with GO term descriptions leading to higher recall but lower precision. However, newer LLM models perform statistically significantly better than the oldest model across all performance metrics, suggesting that future models may lead to further improvements. Overall, the results are nondeterministic, with minor variations in prompt resulting in radically different term lists, true to the stochastic nature of LLMs. Our results show that at this point, LLM-based methods are unsuitable as a replacement for standard term enrichment analysis, however they may provide summarization benefits for implicit knowledge integration across extant but unstandardized knowledge, for large sets of features, and where the amount of information is difficult for humans to process.
Collapse
Affiliation(s)
- Marcin P Joachimiak
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - J Harry Caufield
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - Nomi L Harris
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | | | - Christopher J Mungall
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| |
Collapse
|
3
|
Dessimoz C, Thomas PD. AI and the democratization of knowledge. Sci Data 2024; 11:268. [PMID: 38443367 PMCID: PMC10915151 DOI: 10.1038/s41597-024-03099-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Accepted: 02/28/2024] [Indexed: 03/07/2024] Open
Affiliation(s)
- Christophe Dessimoz
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, USA.
| |
Collapse
|
4
|
David R, Rybina A, Burel J, Heriche J, Audergon P, Boiten J, Coppens F, Crockett S, Exter K, Fahrner S, Fratelli M, Goble C, Gormanns P, Grantner T, Grüning B, Gurwitz KT, Hancock JM, Harmse H, Holub P, Juty N, Karnbach G, Karoune E, Keppler A, Klemeier J, Lancelotti C, Legras J, Lister AL, Longo DL, Ludwig R, Madon B, Massimi M, Matser V, Matteoni R, Mayrhofer MT, Ohmann C, Panagiotopoulou M, Parkinson H, Perseil I, Pfander C, Pieruschka R, Raess M, Rauber A, Richard AS, Romano P, Rosato A, Sánchez‐Pla A, Sansone S, Sarkans U, Serrano‐Solano B, Tang J, Tanoli Z, Tedds J, Wagener H, Weise M, Westerhoff HV, Wittner R, Ewbank J, Blomberg N, Gribbon P. "Be sustainable": EOSC-Life recommendations for implementation of FAIR principles in life science data handling. EMBO J 2023; 42:e115008. [PMID: 37964598 PMCID: PMC10690449 DOI: 10.15252/embj.2023115008] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 09/12/2023] [Accepted: 09/18/2023] [Indexed: 11/16/2023] Open
Abstract
The main goals and challenges for the life science communities in the Open Science framework are to increase reuse and sustainability of data resources, software tools, and workflows, especially in large-scale data-driven research and computational analyses. Here, we present key findings, procedures, effective measures and recommendations for generating and establishing sustainable life science resources based on the collaborative, cross-disciplinary work done within the EOSC-Life (European Open Science Cloud for Life Sciences) consortium. Bringing together 13 European life science research infrastructures, it has laid the foundation for an open, digital space to support biological and medical research. Using lessons learned from 27 selected projects, we describe the organisational, technical, financial and legal/ethical challenges that represent the main barriers to sustainability in the life sciences. We show how EOSC-Life provides a model for sustainable data management according to FAIR (findability, accessibility, interoperability, and reusability) principles, including solutions for sensitive- and industry-related resources, by means of cross-disciplinary training and best practices sharing. Finally, we illustrate how data harmonisation and collaborative work facilitate interoperability of tools, data, solutions and lead to a better understanding of concepts, semantics and functionalities in the life sciences.
Collapse
|
5
|
de Crécy-lagard V, Amorin de Hegedus R, Arighi C, Babor J, Bateman A, Blaby I, Blaby-Haas C, Bridge AJ, Burley SK, Cleveland S, Colwell LJ, Conesa A, Dallago C, Danchin A, de Waard A, Deutschbauer A, Dias R, Ding Y, Fang G, Friedberg I, Gerlt J, Goldford J, Gorelik M, Gyori BM, Henry C, Hutinet G, Jaroch M, Karp PD, Kondratova L, Lu Z, Marchler-Bauer A, Martin MJ, McWhite C, Moghe GD, Monaghan P, Morgat A, Mungall CJ, Natale DA, Nelson WC, O’Donoghue S, Orengo C, O’Toole KH, Radivojac P, Reed C, Roberts RJ, Rodionov D, Rodionova IA, Rudolf JD, Saleh L, Sheynkman G, Thibaud-Nissen F, Thomas PD, Uetz P, Vallenet D, Carter EW, Weigele PR, Wood V, Wood-Charlson EM, Xu J. A roadmap for the functional annotation of protein families: a community perspective. Database (Oxford) 2022; 2022:baac062. [PMID: 35961013 PMCID: PMC9374478 DOI: 10.1093/database/baac062] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 06/28/2022] [Accepted: 08/03/2022] [Indexed: 12/23/2022]
Abstract
Over the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.
Collapse
Affiliation(s)
- Valérie de Crécy-lagard
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | | | - Cecilia Arighi
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19713, USA
| | - Jill Babor
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Ian Blaby
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Crysten Blaby-Haas
- Biology Department, Brookhaven National Laboratory, Upton, NY 11973, USA
| | - Alan J Bridge
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva 4 CH-1211, Switzerland
| | - Stephen K Burley
- RCSB Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Stacey Cleveland
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Lucy J Colwell
- Departmenf of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK
| | - Ana Conesa
- Spanish National Research Council, Institute for Integrative Systems Biology, Paterna, Valencia 46980, Spain
| | - Christian Dallago
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, i12, Boltzmannstr. 3, Garching/Munich 85748, Germany
| | - Antoine Danchin
- School of Biomedical Sciences, Li KaShing Faculty of Medicine, The University of Hong Kong, 21 Sassoon Road, Pokfulam, SAR Hong Kong 999077, China
| | - Anita de Waard
- Research Collaboration Unit, Elsevier, Jericho, VT 05465, USA
| | - Adam Deutschbauer
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Raquel Dias
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Yousong Ding
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, University of Florida, Gainesville, FL 32610, USA
| | - Gang Fang
- NYU-Shanghai, Shanghai 200120, China
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
| | - John Gerlt
- Institute for Genomic Biology and Departments of Biochemistry and Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Joshua Goldford
- Physics of Living Systems, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Mark Gorelik
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Benjamin M Gyori
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
| | - Christopher Henry
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Geoffrey Hutinet
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Marshall Jaroch
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Peter D Karp
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025, USA
| | | | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Maria-Jesus Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Claire McWhite
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
| | - Gaurav D Moghe
- Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, USA
| | - Paul Monaghan
- Department of Agricultural Education and Communication, University of Florida, Gainesville, FL 32611, USA
| | - Anne Morgat
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva 4 CH-1211, Switzerland
| | - Christopher J Mungall
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Darren A Natale
- Georgetown University Medical Center, Washington, DC 20007, USA
| | - William C Nelson
- Biological Sciences Division, Pacific Northwest National Laboratories, Richland, WA 99354, USA
| | - Seán O’Donoghue
- School of Biotechnology and Biomolecular Sciences, University of NSW, Sydney, NSW 2052, Australia
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | | | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| | - Colbie Reed
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | | | - Dmitri Rodionov
- Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA 92037, USA
| | - Irina A Rodionova
- Department of Bioengineering, Division of Engineering, University of California at San Diego, La Jolla, CA 92093-0412, USA
| | - Jeffrey D Rudolf
- Department of Chemistry, University of Florida, Gainesville, FL 32611, USA
| | - Lana Saleh
- New England Biolabs, Ipswich, MA 01938, USA
| | - Gloria Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
| | - Francoise Thibaud-Nissen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90033, USA
| | - Peter Uetz
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - David Vallenet
- LABGeM, Génomique Métabolique, CEA, Genoscope, Institut François Jacob, Université d’Évry, Université Paris-Saclay, CNRS, Evry 91057, France
| | - Erica Watson Carter
- Department of Plant Pathology, University of Florida Citrus Research and Education Center, 700 Experiment Station Rd., Lake Alfred, FL 33850, USA
| | | | - Valerie Wood
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK
| | - Elisha M Wood-Charlson
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Jin Xu
- Department of Plant Pathology, University of Florida Citrus Research and Education Center, 700 Experiment Station Rd., Lake Alfred, FL 33850, USA
| |
Collapse
|
6
|
Gasparetto T, Orlova M, Vernikovskiy A. Same, same but different: analyzing uncertainty of outcome in Formula One races. MANAGING SPORT AND LEISURE 2022. [DOI: 10.1080/23750472.2022.2085619] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Thadeu Gasparetto
- Department of Management, National Research University Higher School of Economics (HSE), Saint Petersburg, Russian Federation
| | - Marina Orlova
- Department of Management, National Research University Higher School of Economics (HSE), Saint Petersburg, Russian Federation
| | - Anton Vernikovskiy
- Department of Management, National Research University Higher School of Economics (HSE), Saint Petersburg, Russian Federation
| |
Collapse
|
7
|
Rodriguez-Esteban R. New reasons for biologists to write with a formal language. Database (Oxford) 2022; 2022:6600538. [PMID: 35657112 PMCID: PMC9216469 DOI: 10.1093/database/baac039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 03/18/2022] [Accepted: 05/17/2022] [Indexed: 12/03/2022]
Abstract
Current biological writing is afflicted by the use of ambiguous names, convoluted sentences, vague statements and narrative-fitted storylines. This represents a challenge for biological research in general and in particular for fields such as biological database curation and text mining, which have been tasked to cope with exponentially growing content. Improving the quality of biological writing by encouraging unambiguity and precision would foster expository discipline and machine reasoning. More specifically, the routine inclusion of formal languages in biological writing would improve our ability to describe, compile and model biology.
Collapse
Affiliation(s)
- Raul Rodriguez-Esteban
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Grenzacherstrasse 124 , Basel 4070, Switzerland
| |
Collapse
|
8
|
Van Meenen J, Leysen H, Chen H, Baccarne R, Walter D, Martin B, Maudsley S. Making Biomedical Sciences publications more accessible for machines. MEDICINE, HEALTH CARE, AND PHILOSOPHY 2022; 25:179-190. [PMID: 35039972 DOI: 10.1007/s11019-022-10069-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 01/08/2022] [Indexed: 06/14/2023]
Abstract
With the rapidly expanding catalogue of scientific publications, especially within the Biomedical Sciences field, it is becoming increasingly difficult for researchers to search for, read or even interpret emerging scientific findings. PubMed, just one of the current biomedical data repositories, comprises over 33 million citations for biomedical research, and over 2500 publications are added each day. To further strengthen the impact biomedical research, we suggest that there should be more synergy between publications and machines. By bringing machines into the realm of research and publication, we can greatly augment the assessment, investigation and cataloging of the biomedical literary corpus. The effective application of machine-based manuscript assessment and interpretation is now crucial, and potentially stands as the most effective way for researchers to comprehend and process the tsunami of biomedical data and literature. Many biomedical manuscripts are currently published online in poorly searchable document types, with figures and data presented in formats that are partially inaccessible to machine-based approaches. The structure and format of biomedical manuscripts should be adapted to facilitate machine-assisted interrogation of this important literary corpus. In this context, it is important to embrace the concept that biomedical scientists should also write manuscripts that can be read by machines. It is likely that an enhanced human-machine synergy in reading biomedical publications will greatly enhance biomedical data retrieval and reveal novel insights into complex datasets.
Collapse
Affiliation(s)
- Joris Van Meenen
- Receptor Biology Lab, Department of Biomedical Sciences, University of Antwerp, Wilrijk, 2610, Antwerp, Belgium
- Antwerp Research Group for Ocular Science, Department of Translational Neurosciences, University of Antwerp, Wilrijk, 2610, Antwerp, Belgium
| | - Hanne Leysen
- Receptor Biology Lab, Department of Biomedical Sciences, University of Antwerp, Wilrijk, 2610, Antwerp, Belgium
| | - Hongyu Chen
- Weill Cornell Medical College, New York, NY, USA
| | - Rudi Baccarne
- Anet Library Automation, University of Antwerp, Wilrijk, 2610, Antwerp, Belgium
| | - Deborah Walter
- Receptor Biology Lab, Department of Biomedical Sciences, University of Antwerp, Wilrijk, 2610, Antwerp, Belgium
| | - Bronwen Martin
- Faculty of Pharmaceutical, Veterinary and Biomedical Sciences, University of Antwerp, Wilrijk, 2610, Antwerp, Belgium
| | - Stuart Maudsley
- Receptor Biology Lab, Department of Biomedical Sciences, University of Antwerp, Wilrijk, 2610, Antwerp, Belgium.
| |
Collapse
|
9
|
Perry A, Netscher S. Measuring the time spent on data curation. JOURNAL OF DOCUMENTATION 2022. [DOI: 10.1108/jd-08-2021-0167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeBudgeting data curation tasks in research projects is difficult. In this paper, we investigate the time spent on data curation, more specifically on cleaning and documenting quantitative data for data sharing. We develop recommendations on cost factors in research data management.Design/methodology/approachWe make use of a pilot study conducted at the GESIS Data Archive for the Social Sciences in Germany between December 2016 and September 2017. During this period, data curators at GESIS - Leibniz Institute for the Social Sciences documented their working hours while cleaning and documenting data from ten quantitative survey studies. We analyse recorded times and discuss with the data curators involved in this work to identify and examine important cost factors in data curation, that is aspects that increase hours spent and factors that lead to a reduction of their work.FindingsWe identify two major drivers of time spent on data curation: The size of the data and personal information contained in the data. Learning effects can occur when data are similar, that is when they contain same variables. Important interdependencies exist between individual tasks in data curation and in connection with certain data characteristics.Originality/valueThe different tasks of data curation, time spent on them and interdependencies between individual steps in curation have so far not been analysed.
Collapse
|
10
|
Ramsey J, McIntosh B, Renfro D, Aleksander SA, LaBonte S, Ross C, Zweifel AE, Liles N, Farrar S, Gill JJ, Erill I, Ades S, Berardini TZ, Bennett JA, Brady S, Britton R, Carbon S, Caruso SM, Clements D, Dalia R, Defelice M, Doyle EL, Friedberg I, Gurney SMR, Hughes L, Johnson A, Kowalski JM, Li D, Lovering RC, Mans TL, McCarthy F, Moore SD, Murphy R, Paustian TD, Perdue S, Peterson CN, Prüß BM, Saha MS, Sheehy RR, Tansey JT, Temple L, Thorman AW, Trevino S, Vollmer AC, Walbot V, Willey J, Siegele DA, Hu JC. Crowdsourcing biocuration: The Community Assessment of Community Annotation with Ontologies (CACAO). PLoS Comput Biol 2021; 17:e1009463. [PMID: 34710081 PMCID: PMC8553046 DOI: 10.1371/journal.pcbi.1009463] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Experimental data about gene functions curated from the primary literature have enormous value for research scientists in understanding biology. Using the Gene Ontology (GO), manual curation by experts has provided an important resource for studying gene function, especially within model organisms. Unprecedented expansion of the scientific literature and validation of the predicted proteins have increased both data value and the challenges of keeping pace. Capturing literature-based functional annotations is limited by the ability of biocurators to handle the massive and rapidly growing scientific literature. Within the community-oriented wiki framework for GO annotation called the Gene Ontology Normal Usage Tracking System (GONUTS), we describe an approach to expand biocuration through crowdsourcing with undergraduates. This multiplies the number of high-quality annotations in international databases, enriches our coverage of the literature on normal gene function, and pushes the field in new directions. From an intercollegiate competition judged by experienced biocurators, Community Assessment of Community Annotation with Ontologies (CACAO), we have contributed nearly 5,000 literature-based annotations. Many of those annotations are to organisms not currently well-represented within GO. Over a 10-year history, our community contributors have spurred changes to the ontology not traditionally covered by professional biocurators. The CACAO principle of relying on community members to participate in and shape the future of biocuration in GO is a powerful and scalable model used to promote the scientific enterprise. It also provides undergraduate students with a unique and enriching introduction to critical reading of primary literature and acquisition of marketable skills.
Collapse
Affiliation(s)
- Jolene Ramsey
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
| | - Brenley McIntosh
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Daniel Renfro
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Suzanne A. Aleksander
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Sandra LaBonte
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Curtis Ross
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
| | - Adrienne E. Zweifel
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Nathan Liles
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Shabnam Farrar
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Jason J. Gill
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
- Department of Animal Science, Texas A&M University, College Station, Texas, United States of America
| | - Ivan Erill
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, Maryland, United States of America
- Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, Baltimore, Maryland, United States of America
| | - Sarah Ades
- Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Tanya Z. Berardini
- The Arabidopsis Information Resource, Phoenix Bioinformatics, Newark, California, United States of America
| | - Jennifer A. Bennett
- Department of Biology and Earth Science, Otterbein University, Westerville, Ohio, United States of America
| | - Siobhan Brady
- Department of Plant Biology and Genome Center, University of California Davis, Davis, California, United States of America
| | - Robert Britton
- Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America
| | - Seth Carbon
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Steven M. Caruso
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, Maryland, United States of America
| | - Dave Clements
- Department of Biology, John Hopkins University, Baltimore, Maryland, United States of America
| | - Ritu Dalia
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Meredith Defelice
- Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Erin L. Doyle
- Biology Department, Doane University, Crete, Nebraska, United States of America
| | - Iddo Friedberg
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Susan M. R. Gurney
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Lee Hughes
- Department of Biological Sciences, University of North Texas, Denton, Texas, United States of America
| | - Allison Johnson
- Center for the Study of Biological Complexity, Virginia Commonwealth University, Richmond, Virginia, United States of America
| | - Jason M. Kowalski
- Biological Sciences Department, University of Wisconsin-Parkside, Kenosha, Wisconsin, United States of America
| | - Donghui Li
- The Arabidopsis Information Resource, Phoenix Bioinformatics, Newark, California, United States of America
| | - Ruth C. Lovering
- Institute of Cardiovascular Science, University College London, London, United Kingdom
| | - Tamara L. Mans
- Department of Biochemistry and Biotechnology, Minnesota State University Moorhead, Brooklyn Park, Minnesota, United States of America
| | - Fiona McCarthy
- Department of Basic Science, College of Veterinary Medicine, Mississippi State University, Starkville, Mississippi, United States of America
| | - Sean D. Moore
- Burnett School of Biomedical Sciences, University of Central Florida, Orlando, Florida, United States of America
| | - Rebecca Murphy
- Department of Biology, Centenary College of Louisiana, Shreveport, Louisiana, United States of America
| | - Timothy D. Paustian
- Department of Bacteriology, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Sarah Perdue
- Biological Sciences Department, University of Wisconsin-Parkside, Kenosha, Wisconsin, United States of America
| | - Celeste N. Peterson
- Biology Department, Suffolk University, Boston, Massachusetts, United States of America
| | - Birgit M. Prüß
- Microbiological Sciences Department, North Dakota State University, Fargo, North Dakota, United States of America
| | - Margaret S. Saha
- Department of Biology, College of William & Mary, Williamsburg, Virginia, United States of America
| | - Robert R. Sheehy
- Biology Department, Radford University, Radford, Virginia, United States of America
| | - John T. Tansey
- Department of Biochemistry and Molecular Biology, Otterbein University, Westerville, Ohio, United States of America
| | - Louise Temple
- School of Integrated Sciences, James Madison University, Harrisonburg, Virginia, United States of America
| | - Alexander William Thorman
- Department of Environmental and Public Health Sciences, University of Cincinnati, Cincinnati, Ohio, United States of America
| | - Saul Trevino
- Department of Chemistry, Math, and Physics, Houston Baptist University, Houston, Texas, United States of America
| | - Amy Cheng Vollmer
- Department of Biology, Swarthmore College, Swarthmore, Pennsylvania, United States of America
| | - Virginia Walbot
- Department of Biology, Stanford University, Stanford, California, United States of America
| | - Joanne Willey
- Department of Science Education, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York, United States of America
| | - Deborah A. Siegele
- Department of Biology, Texas A&M University, College Station, Texas, United States of America
| | - James C. Hu
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
| |
Collapse
|
11
|
Paley S, Keseler IM, Krummenacker M, Karp PD. Leveraging Curation Among Escherichia coli Pathway/Genome Databases Using Ortholog-Based Annotation Propagation. Front Microbiol 2021; 12:614355. [PMID: 33763039 PMCID: PMC7982652 DOI: 10.3389/fmicb.2021.614355] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 03/02/2021] [Indexed: 12/19/2022] Open
Abstract
Updating genome databases to reflect newly published molecular findings for an organism was hard enough when only a single strain of a given organism had been sequenced. With multiple sequenced strains now available for many organisms, the challenge has grown significantly because of the still-limited resources available for the manual curation that corrects errors and captures new knowledge. We have developed a method to automatically propagate multiple types of curated knowledge from genes and proteins in one genome database to their orthologs in uncurated databases for related strains, imposing several quality-control filters to reduce the chances of introducing errors. We have applied this method to propagate information from the highly curated EcoCyc database for Escherichia coli K-12 to databases for 480 other Escherichia coli strains in the BioCyc database collection. The increase in value and utility of the target databases after propagation is considerable. Target databases received updates for an average of 2,535 proteins each. In addition to widespread addition and regularization of gene and protein names, 97% of the target databases were improved by the addition of at least 200 new protein complexes, at least 800 new or updated reaction assignments, and at least 2,400 sets of GO annotations.
Collapse
Affiliation(s)
- Suzanne Paley
- Bioinformatics Research Group, SRI International, Menlo Park, CA, United States
| | - Ingrid M Keseler
- Bioinformatics Research Group, SRI International, Menlo Park, CA, United States
| | - Markus Krummenacker
- Bioinformatics Research Group, SRI International, Menlo Park, CA, United States
| | - Peter D Karp
- Bioinformatics Research Group, SRI International, Menlo Park, CA, United States
| |
Collapse
|
12
|
Salimi N, Edwards L, Foos G, Greenbaum JA, Martini S, Reardon B, Shackelford D, Vita R, Zalman L, Peters B, Sette A. A behind-the-scenes tour of the IEDB curation process: an optimized process empirically integrating automation and human curation efforts. Immunology 2020; 161:139-147. [PMID: 32615639 PMCID: PMC7496777 DOI: 10.1111/imm.13234] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 06/11/2020] [Accepted: 06/22/2020] [Indexed: 12/13/2022] Open
Abstract
The Immune Epitope Database and Analysis Resource (IEDB) provides the scientific community with open access to epitope data, as well as epitope prediction and analysis tools. The IEDB houses the most extensive collection of experimentally validated B‐cell and T‐cell epitope data, sourced primarily from published literature by expert curation. The data procurement requires systematic identification, categorization, curation and quality‐checking processes. Here, we provide insights into these processes, with particular focus on the dividends they have paid in terms of attaining project milestones, as well as how objective analyses of our processes have identified opportunities for process optimization. These experiences are shared as a case study of the benefits of process implementation and review in biomedical big data, as well as to encourage idea‐sharing among players in this ever‐growing space.
Collapse
Affiliation(s)
- Nima Salimi
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Lindy Edwards
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Gabriele Foos
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Jason A Greenbaum
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Sheridan Martini
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Brian Reardon
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Deborah Shackelford
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Randi Vita
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Leora Zalman
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Bjoern Peters
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA.,Department of Medicine, University of California, San Diego, San Diego, CA, USA
| | - Alessandro Sette
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA.,Department of Medicine, University of California, San Diego, San Diego, CA, USA
| |
Collapse
|
13
|
Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases. GENOMICS PROTEOMICS & BIOINFORMATICS 2020; 18:91-103. [PMID: 32652120 PMCID: PMC7646089 DOI: 10.1016/j.gpb.2018.11.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Revised: 10/24/2018] [Accepted: 12/14/2018] [Indexed: 11/27/2022]
|
14
|
Cosenz F, Qorbani D, Yamaguchi Y. An exploration of digital ride-hailing multisided platforms' market dynamics: empirical evidence from the Uber case study. INTERNATIONAL JOURNAL OF PRODUCTIVITY AND PERFORMANCE MANAGEMENT 2020. [DOI: 10.1108/ijppm-10-2019-0475] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeThe purpose of this paper is to experiment a dynamic performance management (DPM) approach to explore and assess the business dynamics of digital ride-hailing platforms with a focus on both supply and demand sides, and related interplays.Design/methodology/approachThe research adopts the DPM framework supported by simulation-based experimentations for developing a systemic case interpretation of Uber Inc. and its specific business complexity.FindingsThe emerging scenario analysis reveals that changes in the commission percentage for drivers and cutting prices for customers (car hailers) by competitors have significant impacts on the car-hailing industry.Originality/valueDPM and associated simulation-based analysis of the ride-hailing business may provide significant managerial decision insights and a ground base research in a relatively less-explored field within the strategic management domain.
Collapse
|
15
|
From Small Association to Global Pundit: 2009–2020. A HISTORY OF ORGANIZATIONAL CHANGE 2020. [PMCID: PMC7355338 DOI: 10.1007/978-3-030-48270-1_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
When Jean Todt became Fédération Internationale de l’Automobile (FIA) president in 2009, he made good use of the Mosley legacy. At the same time, he engineered a series of new reforms that would mark the greatest difference between the old and the new FIA. This chapter explores a selection of organisational innovations during Todt’s three presidential periods (2009–2021). This selection includes two presidential elections in 2009 and 2013 which had a hitherto unrivalled emphasis on good governance, a political debacle (the F1 Bahrain Grand Prix), a structural revamp (the establishing the FIA Statutes Review), the introduction of an external audit (aided by consulting firm, Deloitte), and a proactive media strategy to comprehend a media logic (by publishing AUTO magazine from 2012 onwards). Together, this combination of substantive and symbolic actions represents the FIA’s desire to renew its legitimacy among its members, motorists, and other stakeholders.
Collapse
|
16
|
Arnaboldi V, Raciti D, Van Auken K, Chan JN, Müller HM, Sternberg PW. Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase. Database (Oxford) 2020; 2020:baaa006. [PMID: 32185395 PMCID: PMC7078066 DOI: 10.1093/database/baaa006] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Revised: 01/08/2020] [Accepted: 01/14/2020] [Indexed: 01/17/2023]
Abstract
Biological knowledgebases rely on expert biocuration of the research literature to maintain up-to-date collections of data organized in machine-readable form. To enter information into knowledgebases, curators need to follow three steps: (i) identify papers containing relevant data, a process called triaging; (ii) recognize named entities; and (iii) extract and curate data in accordance with the underlying data models. WormBase (WB), the authoritative repository for research data on Caenorhabditis elegans and other nematodes, uses text mining (TM) to semi-automate its curation pipeline. In addition, WB engages its community, via an Author First Pass (AFP) system, to help recognize entities and classify data types in their recently published papers. In this paper, we present a new WB AFP system that combines TM and AFP into a single application to enhance community curation. The system employs string-searching algorithms and statistical methods (e.g. support vector machines (SVMs)) to extract biological entities and classify data types, and it presents the results to authors in a web form where they validate the extracted information, rather than enter it de novo as the previous form required. With this new system, we lessen the burden for authors, while at the same time receive valuable feedback on the performance of our TM tools. The new user interface also links out to specific structured data submission forms, e.g. for phenotype or expression pattern data, giving the authors the opportunity to contribute a more detailed curation that can be incorporated into WB with minimal curator review. Our approach is generalizable and could be applied to additional knowledgebases that would like to engage their user community in assisting with the curation. In the five months succeeding the launch of the new system, the response rate has been comparable with that of the previous AFP version, but the quality and quantity of the data received has greatly improved.
Collapse
Affiliation(s)
- Valerio Arnaboldi
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Daniela Raciti
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Kimberly Van Auken
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Juancarlos N Chan
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Hans-Michael Müller
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Paul W Sternberg
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| |
Collapse
|
17
|
Karp PD, Billington R, Caspi R, Fulcher CA, Latendresse M, Kothari A, Keseler IM, Krummenacker M, Midford PE, Ong Q, Ong WK, Paley SM, Subhraveti P. The BioCyc collection of microbial genomes and metabolic pathways. Brief Bioinform 2019; 20:1085-1093. [PMID: 29447345 PMCID: PMC6781571 DOI: 10.1093/bib/bbx085] [Citation(s) in RCA: 535] [Impact Index Per Article: 89.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Revised: 06/22/2017] [Indexed: 01/31/2023] Open
Abstract
BioCyc.org is a microbial genome Web portal that combines thousands of genomes with additional information inferred by computer programs, imported from other databases and curated from the biomedical literature by biologist curators. BioCyc also provides an extensive range of query tools, visualization services and analysis software. Recent advances in BioCyc include an expansion in the content of BioCyc in terms of both the number of genomes and the types of information available for each genome; an expansion in the amount of curated content within BioCyc; and new developments in the BioCyc software tools including redesigned gene/protein pages and metabolite pages; new search tools; a new sequence-alignment tool; a new tool for visualizing groups of related metabolic pathways; and a facility called SmartTables, which enables biologists to perform analyses that previously would have required a programmer's assistance.
Collapse
|
18
|
Liew MS, Zhang J, See J, Ong YL. Usability Challenges for Health and Wellness Mobile Apps: Mixed-Methods Study Among mHealth Experts and Consumers. JMIR Mhealth Uhealth 2019; 7:e12160. [PMID: 30698528 PMCID: PMC6372932 DOI: 10.2196/12160] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2018] [Revised: 12/11/2018] [Accepted: 12/12/2018] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND By 2019, there will be an estimated 4.68 billion mobile phone users globally. This increase comes with an unprecedented proliferation in mobile apps, a plug-and-play product positioned to improve lives in innumerable ways. Within this landscape, medical apps will see a 41% compounded annual growth rate between 2015 and 2020, but paradoxically, prevailing evidence indicates declining downloads of such apps and decreasing "stickiness" with the intended end users. OBJECTIVE As usability is a prerequisite for success of health and wellness mobile apps, this paper aims to provide insights and suggestions for improving usability experience of the mobile health (mHealth) app by exploring the degree of alignment between mHealth insiders and consumers. METHODS Usability-related major themes were selected from over 20 mHealth app development studies. The list of themes, grouped into 5 categories using the Nielsen usability model, was then used as a framework to identify and classify the responses from mHealth expert (insider) interviews. Responses from the qualitative phase were integrated into some questions for a quantitative consumer survey. Subsequently, categorical data from qualitative mHealth insider interviews and numerical data from a quantitative consumer survey were compared in order to identify common usability themes and areas of divergence. RESULTS Of the 5 usability attributes described in Nielsen model, Satisfaction ranked as the top attribute for both mHealth insiders and consumers. Satisfaction refers to user likability, comfort, and pleasure. The consumer survey yielded 451 responses. Out of 9 mHealth insiders' top concerns, 5 were similar to those of the consumers. On the other hand, consumers did not grade themes such as Intuitiveness as important, which was deemed vital by mHealth insiders. Other concerns of the consumers include in-app charges and advertisements. CONCLUSIONS This study supports and contributes to the existing pool of mixed-research studies. Strengthening the connectivity between suppliers and users (through the designed research tool) will help increase uptake of mHealth apps. In a holistic manner, this will have a positive overall outcome for the mHealth app ecosystem.
Collapse
Affiliation(s)
- Mei Shan Liew
- Alliance Manchester Business School, Manchester, United Kingdom
| | - Jian Zhang
- Alliance Manchester Business School, Manchester, United Kingdom
| | - Jovis See
- Alliance Manchester Business School, Manchester, United Kingdom
| | - Yen Leng Ong
- Alliance Manchester Business School, Manchester, United Kingdom
| |
Collapse
|
19
|
Naithani S, Gupta P, Preece J, Garg P, Fraser V, Padgitt-Cobb LK, Martin M, Vining K, Jaiswal P. Involving community in genes and pathway curation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5289625. [PMID: 30649295 PMCID: PMC6334007 DOI: 10.1093/database/bay146] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/08/2018] [Accepted: 12/11/2018] [Indexed: 12/25/2022]
Abstract
Biocuration plays a crucial role in building databases and complex systems-level platforms required for processing, annotating and analyzing ‘Big Data’ in biology. However, biocuration efforts cannot keep pace with a dramatic increase in the production of omics data; this presents one of the bottlenecks in genomics. In two pathway curation jamborees, Plant Reactome curators tested strategies for introducing researchers to pathway curation tools, harnessing biologists’ expertise in curating plant pathways and developing a network of community biocurators. We summarize the strategy, workflow and outcomes of these exercises, and discuss the role of community biocuration in advancing databases and genomic resources.
Collapse
Affiliation(s)
- Sushma Naithani
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| | - Parul Gupta
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| | - Justin Preece
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| | - Priyanka Garg
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| | - Valerie Fraser
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA.,Molecular and Cellular Biology Graduate Program, Oregon State University, Corvallis, OR, USA
| | | | - Matthew Martin
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| | - Kelly Vining
- Department of Horticulture, Oregon State University, Corvallis, OR, USA
| | - Pankaj Jaiswal
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| |
Collapse
|
20
|
Poux S, Arighi CN, Magrane M, Bateman A, Wei CH, Lu Z, Boutet E, Bye-A-Jee H, Famiglietti ML, Roechert B, UniProt Consortium T. On expert curation and scalability: UniProtKB/Swiss-Prot as a case study. Bioinformatics 2018; 33:3454-3460. [PMID: 29036270 PMCID: PMC5860168 DOI: 10.1093/bioinformatics/btx439] [Citation(s) in RCA: 75] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Accepted: 07/10/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation Biological knowledgebases, such as UniProtKB/Swiss-Prot, constitute an essential component of daily scientific research by offering distilled, summarized and computable knowledge extracted from the literature by expert curators. While knowledgebases play an increasingly important role in the scientific community, their ability to keep up with the growth of biomedical literature is under scrutiny. Using UniProtKB/Swiss-Prot as a case study, we address this concern via multiple literature triage approaches. Results With the assistance of the PubTator text-mining tool, we tagged more than 10 000 articles to assess the ratio of papers relevant for curation. We first show that curators read and evaluate many more papers than they curate, and that measuring the number of curated publications is insufficient to provide a complete picture as demonstrated by the fact that 8000–10 000 papers are curated in UniProt each year while curators evaluate 50 000–70 000 papers per year. We show that 90% of the papers in PubMed are out of the scope of UniProt, that a maximum of 2–3% of the papers indexed in PubMed each year are relevant for UniProt curation, and that, despite appearances, expert curation in UniProt is scalable. Availability and implementation UniProt is freely available at http://www.uniprot.org/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sylvain Poux
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, Switzerland
| | - Cecilia N Arighi
- Protein Information Resource, University of Delaware, Newark, DE 19711, USA
| | - Michele Magrane
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), US National Library of Medicine, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), US National Library of Medicine, Bethesda, MD 20894, USA
| | - Emmanuel Boutet
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, Switzerland
| | - Hema Bye-A-Jee
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Maria Livia Famiglietti
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, Switzerland
| | - Bernd Roechert
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, Switzerland
| | - The UniProt Consortium
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, Switzerland.,Protein Information Resource, University of Delaware, Newark, DE 19711, USA.,European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.,Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
| |
Collapse
|
21
|
Harper L, Campbell J, Cannon EKS, Jung S, Poelchau M, Walls R, Andorf C, Arnaud E, Berardini TZ, Birkett C, Cannon S, Carson J, Condon B, Cooper L, Dunn N, Elsik CG, Farmer A, Ficklin SP, Grant D, Grau E, Herndon N, Hu ZL, Humann J, Jaiswal P, Jonquet C, Laporte MA, Larmande P, Lazo G, McCarthy F, Menda N, Mungall CJ, Munoz-Torres MC, Naithani S, Nelson R, Nesdill D, Park C, Reecy J, Reiser L, Sanderson LA, Sen TZ, Staton M, Subramaniam S, Tello-Ruiz MK, Unda V, Unni D, Wang L, Ware D, Wegrzyn J, Williams J, Woodhouse M, Yu J, Main D. AgBioData consortium recommendations for sustainable genomics and genetics databases for agriculture. Database (Oxford) 2018; 2018:5096675. [PMID: 30239679 PMCID: PMC6146126 DOI: 10.1093/database/bay088] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Revised: 07/19/2018] [Accepted: 07/30/2018] [Indexed: 01/07/2023]
Abstract
The future of agricultural research depends on data. The sheer volume of agricultural biological data being produced today makes excellent data management essential. Governmental agencies, publishers and science funders require data management plans for publicly funded research. Furthermore, the value of data increases exponentially when they are properly stored, described, integrated and shared, so that they can be easily utilized in future analyses. AgBioData (https://www.agbiodata.org) is a consortium of people working at agricultural biological databases, data archives and knowledgbases who strive to identify common issues in database development, curation and management, with the goal of creating database products that are more Findable, Accessible, Interoperable and Reusable. We strive to promote authentic, detailed, accurate and explicit communication between all parties involved in scientific data. As a step toward this goal, we present the current state of biocuration, ontologies, metadata and persistence, database platforms, programmatic (machine) access to data, communication and sustainability with regard to data curation. Each section describes challenges and opportunities for these topics, along with recommendations and best practices.
Collapse
Affiliation(s)
- Lisa Harper
- Corn Insects and Crop Genetics Research Unit, USDA-ARS, Ames, IA, USA
| | | | - Ethalinda K S Cannon
- Corn Insects and Crop Genetics Research Unit, USDA-ARS, Ames, IA, USA
- Computer Science, Iowa State University, Ames, IA, USA
| | - Sook Jung
- Horticulture, Washington State University, Pullman, WA, USA
| | - Monica Poelchau
- National Agricultural Library, USDA Agricultural Research Service, Beltsville, MD, USA
| | | | - Carson Andorf
- Corn Insects and Crop Genetics Research Unit, USDA-ARS, Ames, IA, USA
- Computer Science, Iowa State University, Ames, IA, USA
| | - Elizabeth Arnaud
- Bioversity International, Informatics Unit, Conservation and Availability Programme, Parc Scientifique Agropolis II, Montpellier, France
| | - Tanya Z Berardini
- The Arabidopsis Information Resource, Phoenix Bioinformatics, Fremont, CA, USA
| | | | - Steve Cannon
- Corn Insects and Crop Genetics Research Unit, USDA-ARS, Ames, IA, USA
| | - James Carson
- Texas Advanced Computing Center, The University of Texas at Austin, Austin, TX, USA
| | - Bradford Condon
- Entomology and Plant Pathology, University of Tennessee Knoxville, Knoxville, TN, USA
| | - Laurel Cooper
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| | - Nathan Dunn
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Christine G Elsik
- Division of Animal Sciences and Division of Plant Sciences, University of Missouri, Columbia, MO, USA
| | - Andrew Farmer
- National Center for Genome Resources, Santa Fe, NM, USA
| | | | - David Grant
- Corn Insects and Crop Genetics Research Unit, USDA-ARS, Ames, IA, USA
| | - Emily Grau
- National Center for Genome Resources, Santa Fe, NM, USA
| | - Nic Herndon
- Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT, USA
| | - Zhi-Liang Hu
- Animal Science, Iowa State University, Ames, USA
| | - Jodi Humann
- Horticulture, Washington State University, Pullman, WA, USA
| | - Pankaj Jaiswal
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| | - Clement Jonquet
- Laboratory of Informatics, Robotics, Microelectronics of Montpellier, University of Montpellier & CNRS, Montpellier, France
| | - Marie-Angélique Laporte
- Bioversity International, Informatics Unit, Conservation and Availability Programme, Parc Scientifique Agropolis II, Montpellier, France
| | | | - Gerard Lazo
- Crop Improvement and Genetics Research Unit, USDA-ARS, Albany, CA, USA
| | - Fiona McCarthy
- School of Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ, USA
| | | | | | | | - Sushma Naithani
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| | - Rex Nelson
- Corn Insects and Crop Genetics Research Unit, USDA-ARS, Ames, IA, USA
| | - Daureen Nesdill
- Marriott Library, University of Utah, Salt Lake City, UT, USA
| | - Carissa Park
- Animal Science, Iowa State University, Ames, USA
| | - James Reecy
- Animal Science, Iowa State University, Ames, USA
| | - Leonore Reiser
- The Arabidopsis Information Resource, Phoenix Bioinformatics, Fremont, CA, USA
| | | | - Taner Z Sen
- Crop Improvement and Genetics Research Unit, USDA-ARS, Albany, CA, USA
| | - Margaret Staton
- Entomology and Plant Pathology, University of Tennessee Knoxville, Knoxville, TN, USA
| | | | | | - Victor Unda
- Horticulture, Washington State University, Pullman, WA, USA
| | - Deepak Unni
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Liya Wang
- Plant Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Doreen Ware
- USDA, Plant, Soil and Nutrition Research, Ithaca, NY, USA
- Plant Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Jill Wegrzyn
- Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT, USA
| | - Jason Williams
- Cold Spring Harbor Laboratory, DNA Learning Center, Cold Spring Harbor, NY, USA
| | - Margaret Woodhouse
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA, USA
| | - Jing Yu
- Horticulture, Washington State University, Pullman, WA, USA
| | - Doreen Main
- Horticulture, Washington State University, Pullman, WA, USA
| |
Collapse
|
22
|
Gabella C, Durinx C, Appel R. Funding knowledgebases: Towards a sustainable funding model for the UniProt use case. F1000Res 2017; 6. [PMID: 29333230 PMCID: PMC5747334 DOI: 10.12688/f1000research.12989.2] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/19/2018] [Indexed: 11/30/2022] Open
Abstract
Millions of life scientists across the world rely on bioinformatics data resources for their research projects. Data resources can be very expensive, especially those with a high added value as the expert-curated knowledgebases. Despite the increasing need for such highly accurate and reliable sources of scientific information, most of them do not have secured funding over the near future and often depend on short-term grants that are much shorter than their planning horizon. Additionally, they are often evaluated as research projects rather than as research infrastructure components. In this work, twelve funding models for data resources are described and applied on the case study of the Universal Protein Resource (UniProt), a key resource for protein sequences and functional information knowledge. We show that most of the models present inconsistencies with open access or equity policies, and that while some models do not allow to cover the total costs, they could potentially be used as a complementary income source. We propose the
Infrastructure Model as a sustainable and equitable model for all core data resources in the life sciences. With this model, funding agencies would set aside a fixed percentage of their research grant volumes, which would subsequently be redistributed to core data resources according to well-defined selection criteria. This model, compatible with the principles of open science, is in agreement with several international initiatives such as the Human Frontiers Science Program Organisation (HFSPO) and the OECD Global Science Forum (GSF) project. Here, we have estimated that less than 1% of the total amount dedicated to research grants in the life sciences would be sufficient to cover the costs of the core data resources worldwide, including both knowledgebases and deposition databases.
Collapse
Affiliation(s)
- Chiara Gabella
- ELIXIR-Switzerland, SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| | - Christine Durinx
- ELIXIR-Switzerland, SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| | - Ron Appel
- ELIXIR-Switzerland, SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| |
Collapse
|
23
|
Data management and data enrichment for systems biology projects. J Biotechnol 2017; 261:229-237. [PMID: 28606610 DOI: 10.1016/j.jbiotec.2017.06.007] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2017] [Revised: 06/06/2017] [Accepted: 06/09/2017] [Indexed: 12/24/2022]
Abstract
Collecting, curating, interlinking, and sharing high quality data are central to de.NBI-SysBio, the systems biology data management service center within the de.NBI network (German Network for Bioinformatics Infrastructure). The work of the center is guided by the FAIR principles for scientific data management and stewardship. FAIR stands for the four foundational principles Findability, Accessibility, Interoperability, and Reusability which were established to enhance the ability of machines to automatically find, access, exchange and use data. Within this overview paper we describe three tools (SABIO-RK, Excemplify, SEEK) that exemplify the contribution of de.NBI-SysBio services to FAIR data, models, and experimental methods storage and exchange. The interconnectivity of the tools and the data workflow within systems biology projects will be explained. For many years we are the German partner in the FAIRDOM initiative (http://fair-dom.org) to establish a European data and model management service facility for systems biology.
Collapse
|
24
|
Abstract
Can we use programs for automated or semi-automated information extraction from scientific texts as practical alternatives to professional curation? I show that error rates of current information extraction programs are too high to replace professional curation today. Furthermore, current IEP programs extract single narrow slivers of information, such as individual protein interactions; they cannot extract the large breadth of information extracted by professional curators for databases such as EcoCyc. They also cannot arbitrate among conflicting statements in the literature as curators can. Therefore, funding agencies should not hobble the curation efforts of existing databases on the assumption that a problem that has stymied Artificial Intelligence researchers for more than 60 years will be solved tomorrow. Semi-automated extraction techniques appear to have significantly more potential based on a review of recent tools that enhance curator productivity. But a full cost-benefit analysis for these tools is lacking. Without such analysis it is possible to expend significant effort developing information-extraction tools that automate small parts of the overall curation workflow without achieving a significant decrease in curation costs.Database URL.
Collapse
Affiliation(s)
- Peter D Karp
- Bioinformatics Research Group, SRI, International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA. Tel:650-859-4358; Fax: 650-859-3735; E-mail:
| |
Collapse
|