1
|
Smith N, Yuan X, Melissinos C, Moghe G. FuncFetch: an LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts. Bioinformatics 2024; 41:btae756. [PMID: 39718779 PMCID: PMC11734755 DOI: 10.1093/bioinformatics/btae756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 11/16/2024] [Accepted: 12/20/2024] [Indexed: 12/25/2024] Open
Abstract
MOTIVATION Thousands of genomes are publicly available, however, most genes in those genomes have poorly defined functions. This is partly due to a gap between previously published, experimentally characterized protein activities and activities deposited in databases. This activity deposition is bottlenecked by the time-consuming biocuration process. The emergence of large language models presents an opportunity to speed up the text-mining of protein activities for biocuration. RESULTS We developed FuncFetch-a workflow that integrates NCBI E-Utilities, OpenAI's GPT-4, and Zotero-to screen thousands of manuscripts and extract enzyme activities. Extensive validation revealed high precision and recall of GPT-4 in determining whether the abstract of a given paper indicates the presence of a characterized enzyme activity in that paper. Provided the manuscript, FuncFetch extracted data such as species information, enzyme names, sequence identifiers, substrates, and products, which were subjected to extensive quality analyses. Comparison of this workflow against a manually curated dataset of BAHD acyltransferase activities demonstrated a precision/recall of 0.86/0.64 in extracting substrates. We further deployed FuncFetch on nine large plant enzyme families. Screening 26 543 papers, FuncFetch retrieved 32 605 entries from 5459 selected papers. We also identified multiple extraction errors including incorrect associations, nontarget enzymes, and hallucinations, which highlight the need for further manual curation. The BAHD activities were verified, resulting in a comprehensive functional fingerprint of this family and revealing that ∼70% of the experimentally characterized enzymes are uncurated in the public domain. FuncFetch represents an advance in biocuration and lays the groundwork for predicting the functions of uncharacterized enzymes. AVAILABILITY AND IMPLEMENTATION Code and minimally curated activities are available at: https://github.com/moghelab/funcfetch and https://tools.moghelab.org/funczymedb.
Collapse
Affiliation(s)
- Nathaniel Smith
- Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, United States
| | - Xinyu Yuan
- Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, United States
| | - Chesney Melissinos
- Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, United States
| | - Gaurav Moghe
- Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, United States
| |
Collapse
|
3
|
Reynolds KA, Rosa-Molinar E, Ward RE, Zhang H, Urbanowicz BR, Settles AM. Accelerating biological insight for understudied genes. Integr Comp Biol 2021; 61:2233-2243. [PMID: 33970251 DOI: 10.1093/icb/icab029] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The rapid expansion of genome sequence data is increasing the discovery of protein-coding genes across all domains of life. Annotating these genes with reliable functional information is necessary to understand evolution, to define the full biochemical space accessed by nature, and to identify target genes for biotechnology improvements. The vast majority of proteins are annotated based on sequence conservation with no specific biological, biochemical, genetic, or cellular function identified. Recent technical advances throughout the biological sciences enable experimental research on these understudied protein-coding genes in a broader collection of species. However, scientists have incentives and biases to continue focusing on well documented genes within their preferred model organism. This perspective suggests a research model that seeks to break historic silos of research bias by enabling interdisciplinary teams to accelerate biological functional annotation. We propose an initiative to develop coordinated projects of collaborating evolutionary biologists, cell biologists, geneticists, and biochemists that will focus on subsets of target genes in multiple model organisms. Concurrent analysis in multiple organisms takes advantage of evolutionary divergence and selection, which causes individual species to be better suited as experimental models for specific genes. Most importantly, multisystem approaches would encourage transdisciplinary critical thinking and hypothesis testing that is inherently slow in current biological research.
Collapse
Affiliation(s)
- Kimberly A Reynolds
- The Green Center for Systems Biology and the Department of Biophysics, The University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Eduardo Rosa-Molinar
- Department of Pharmacology & Toxicology, The University of Kansas, Lawrence, KS 66047, USA
| | - Robert E Ward
- Department of Biology, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Hongbin Zhang
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX 77843, USA
| | - Breeanna R Urbanowicz
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, Georgia 30602, USA
| | - A Mark Settles
- Bioengineering Branch, NASA Ames Research Center, Moffett Field, CA USA
| |
Collapse
|
5
|
The UniProt Consortium, Bateman A, Martin MJ, Orchard S, Magrane M, Agivetova R, Ahmad S, Alpi E, Bowler-Barnett EH, Britto R, Bursteinas B, Bye-A-Jee H, Coetzee R, Cukura A, Da Silva A, Denny P, Dogan T, Ebenezer T, Fan J, Castro LG, Garmiri P, Georghiou G, Gonzales L, Hatton-Ellis E, Hussein A, Ignatchenko A, Insana G, Ishtiaq R, Jokinen P, Joshi V, Jyothi D, Lock A, Lopez R, Luciani A, Luo J, Lussi Y, MacDougall A, Madeira F, Mahmoudy M, Menchi M, Mishra A, Moulang K, Nightingale A, Oliveira CS, Pundir S, Qi G, Raj S, Rice D, Lopez MR, Saidi R, Sampson J, Sawford T, Speretta E, Turner E, Tyagi N, Vasudev P, Volynkin V, Warner K, Watkins X, Zaru R, Zellner H, Bridge A, Poux S, Redaschi N, Aimo L, Argoud-Puy G, Auchincloss A, Axelsen K, Bansal P, Baratin D, Blatter MC, Bolleman J, Boutet E, Breuza L, Casals-Casas C, de Castro E, Echioukh KC, Coudert E, Cuche B, Doche M, Dornevil D, Estreicher A, Famiglietti ML, Feuermann M, Gasteiger E, Gehant S, Gerritsen V, Gos A, Gruaz-Gumowski N, Hinz U, Hulo C, Hyka-Nouspikel N, Jungo F, Keller G, Kerhornou A, Lara V, Le Mercier P, Lieberherr D, Lombardot T, Martin X, et alThe UniProt Consortium, Bateman A, Martin MJ, Orchard S, Magrane M, Agivetova R, Ahmad S, Alpi E, Bowler-Barnett EH, Britto R, Bursteinas B, Bye-A-Jee H, Coetzee R, Cukura A, Da Silva A, Denny P, Dogan T, Ebenezer T, Fan J, Castro LG, Garmiri P, Georghiou G, Gonzales L, Hatton-Ellis E, Hussein A, Ignatchenko A, Insana G, Ishtiaq R, Jokinen P, Joshi V, Jyothi D, Lock A, Lopez R, Luciani A, Luo J, Lussi Y, MacDougall A, Madeira F, Mahmoudy M, Menchi M, Mishra A, Moulang K, Nightingale A, Oliveira CS, Pundir S, Qi G, Raj S, Rice D, Lopez MR, Saidi R, Sampson J, Sawford T, Speretta E, Turner E, Tyagi N, Vasudev P, Volynkin V, Warner K, Watkins X, Zaru R, Zellner H, Bridge A, Poux S, Redaschi N, Aimo L, Argoud-Puy G, Auchincloss A, Axelsen K, Bansal P, Baratin D, Blatter MC, Bolleman J, Boutet E, Breuza L, Casals-Casas C, de Castro E, Echioukh KC, Coudert E, Cuche B, Doche M, Dornevil D, Estreicher A, Famiglietti ML, Feuermann M, Gasteiger E, Gehant S, Gerritsen V, Gos A, Gruaz-Gumowski N, Hinz U, Hulo C, Hyka-Nouspikel N, Jungo F, Keller G, Kerhornou A, Lara V, Le Mercier P, Lieberherr D, Lombardot T, Martin X, Masson P, Morgat A, Neto TB, Paesano S, Pedruzzi I, Pilbout S, Pourcel L, Pozzato M, Pruess M, Rivoire C, Sigrist C, Sonesson K, Stutz A, Sundaram S, Tognolli M, Verbregue L, Wu CH, Arighi CN, Arminski L, Chen C, Chen Y, Garavelli JS, Huang H, Laiho K, McGarvey P, Natale DA, Ross K, Vinayaka CR, Wang Q, Wang Y, Yeh LS, Zhang J, Ruch P, Teodoro D. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 2021; 49:D480-D489. [PMID: 33237286 PMCID: PMC7778908 DOI: 10.1093/nar/gkaa1100] [Show More Authors] [Citation(s) in RCA: 4192] [Impact Index Per Article: 1048.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 10/21/2020] [Accepted: 11/02/2020] [Indexed: 02/07/2023] Open
Abstract
The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this article, we describe significant updates that we have made over the last two years to the resource. The number of sequences in UniProtKB has risen to approximately 190 million, despite continued work to reduce sequence redundancy at the proteome level. We have adopted new methods of assessing proteome completeness and quality. We continue to extract detailed annotations from the literature to add to reviewed entries and supplement these in unreviewed entries with annotations provided by automated systems such as the newly implemented Association-Rule-Based Annotator (ARBA). We have developed a credit-based publication submission interface to allow the community to contribute publications and annotations to UniProt entries. We describe how UniProtKB responded to the COVID-19 pandemic through expert curation of relevant entries that were rapidly made available to the research community through a dedicated portal. UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/.
Collapse
|
6
|
Lock A, Harris MA, Rutherford K, Hayles J, Wood V. Community curation in PomBase: enabling fission yeast experts to provide detailed, standardized, sharable annotation from research publications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5827230. [PMID: 32353878 PMCID: PMC7192550 DOI: 10.1093/database/baaa028] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Revised: 02/28/2020] [Accepted: 03/22/2020] [Indexed: 11/22/2022]
Abstract
Maximizing the impact and value of scientific research requires efficient knowledge distribution, which increasingly depends on the integration of standardized published data into online databases. To make data integration more comprehensive and efficient for fission yeast research, PomBase has pioneered a community curation effort that engages publication authors directly in FAIR-sharing of data representing detailed biological knowledge from hypothesis-driven experiments. Canto, an intuitive online curation tool that enables biologists to describe their detailed functional data using shared ontologies, forms the core of PomBase’s system. With 8 years’ experience, and as the author response rate reaches 50%, we review community curation progress and the insights we have gained from the project. We highlight incentives and nudges we deploy to maximize participation, and summarize project outcomes, which include increased knowledge integration and dissemination as well as the unanticipated added value arising from co-curation by publication authors and professional curators.
Collapse
Affiliation(s)
- Antonia Lock
- Department of Genetics, Evolution and Environment, University College London, Gower street, London WC1E 6BT, UK
| | - Midori A Harris
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1GA, UK
| | - Kim Rutherford
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1GA, UK
| | - Jacqueline Hayles
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1GA, UK
| | - Valerie Wood
- Cell Cycle Laboratory, The Francis Crick Institute, Midland Rd, London NW1 1AT, UK
| |
Collapse
|
7
|
Kishore R, Arnaboldi V, Van Slyke CE, Chan J, Nash RS, Urbano JM, Dolan ME, Engel SR, Shimoyama M, Sternberg PW, Genome Resources TAO. Automated generation of gene summaries at the Alliance of Genome Resources. Database (Oxford) 2020; 2020:baaa037. [PMID: 32559296 PMCID: PMC7304461 DOI: 10.1093/database/baaa037] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2020] [Revised: 04/06/2020] [Accepted: 04/29/2020] [Indexed: 12/28/2022]
Abstract
Short paragraphs that describe gene function, referred to as gene summaries, are valued by users of biological knowledgebases for the ease with which they convey key aspects of gene function. Manual curation of gene summaries, while desirable, is difficult for knowledgebases to sustain. We developed an algorithm that uses curated, structured gene data at the Alliance of Genome Resources (Alliance; www.alliancegenome.org) to automatically generate gene summaries that simulate natural language. The gene data used for this purpose include curated associations (annotations) to ontology terms from the Gene Ontology, Disease Ontology, model organism knowledgebase (MOK)-specific anatomy ontologies and Alliance orthology data. The method uses sentence templates for each data category included in the gene summary in order to build a natural language sentence from the list of terms associated with each gene. To improve readability of the summaries when numerous gene annotations are present, we developed a new algorithm that traverses ontology graphs in order to group terms by their common ancestors. The algorithm optimizes the coverage of the initial set of terms and limits the length of the final summary, using measures of information content of each ontology term as a criterion for inclusion in the summary. The automated gene summaries are generated with each Alliance release, ensuring that they reflect current data at the Alliance. Our method effectively leverages category-specific curation efforts of the Alliance member databases to create modular, structured and standardized gene summaries for seven member species of the Alliance. These automatically generated gene summaries make cross-species gene function comparisons tenable and increase discoverability of potential models of human disease. In addition to being displayed on Alliance gene pages, these summaries are also included on several MOK gene pages.
Collapse
Affiliation(s)
- Ranjana Kishore
- WormBase, Division of Biology and Biological Engineering, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA
| | - Valerio Arnaboldi
- WormBase, Division of Biology and Biological Engineering, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA
| | - Ceri E Van Slyke
- ZFIN, The Institute of Neuroscience, 222 Huestis Hall, University of Oregon, Eugene, OR 97403-1254, USA
| | - Juancarlos Chan
- WormBase, Division of Biology and Biological Engineering, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA
| | - Robert S Nash
- Saccharomyces Genome Database, Department of Genetics, Stanford University, 3165 Porter Drive, Palo Alto, CA 94304, USA
| | - Jose M Urbano
- FlyBase, Department of Physiology, Development and Neuroscience, 7 Downing Pl, University of Cambridge, Cambridge CB2 3DY, UK
| | - Mary E Dolan
- MGI, The Jackson Laboratory, Bar Harbor, ME 04609, USA
| | - Stacia R Engel
- Saccharomyces Genome Database, Department of Genetics, Stanford University, 3165 Porter Drive, Palo Alto, CA 94304, USA
| | - Mary Shimoyama
- Rat Genome Database, Department of Biomedical Engineering, Medical College of Wisconsin and Marquette University, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA
| | - Paul W Sternberg
- WormBase, Division of Biology and Biological Engineering, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA
| | | |
Collapse
|