1
|
Pop M, Attwood TK, Blake JA, Bourne PE, Conesa A, Gaasterland T, Hunter L, Kingsford C, Kohlbacher O, Lengauer T, Markel S, Moreau Y, Noble WS, Orengo C, Ouellette BFF, Parida L, Przulj N, Przytycka TM, Ranganathan S, Schwartz R, Valencia A, Warnow T. Biological databases in the age of generative artificial intelligence. BIOINFORMATICS ADVANCES 2025; 5:vbaf044. [PMID: 40177265 PMCID: PMC11964588 DOI: 10.1093/bioadv/vbaf044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2024] [Revised: 01/16/2025] [Accepted: 03/05/2025] [Indexed: 04/05/2025]
Abstract
Summary Modern biological research critically depends on public databases. The introduction and propagation of errors within and across databases can lead to wasted resources as scientists are led astray by bad data or have to conduct expensive validation experiments. The emergence of generative artificial intelligence systems threatens to compound this problem owing to the ease with which massive volumes of synthetic data can be generated. We provide an overview of several key issues that occur within the biological data ecosystem and make several recommendations aimed at reducing data errors and their propagation. We specifically highlight the critical importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering. We also argue for increased theoretical and empirical research on data provenance, error propagation, and on understanding the impact of errors on analytic pipelines. Furthermore, we recommend enhanced funding for the stewardship and maintenance of public biological databases. Availability and implementation Not applicable.
Collapse
Affiliation(s)
- Mihai Pop
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, United States
| | - Teresa K Attwood
- Department of Computer Science, The University of Manchester, Manchester M13 9PL, United Kingdom
| | - Judith A Blake
- The Jackson Laboratory, Bar Harbor, ME 04609, United States
| | - Philip E Bourne
- School of Data Science, The University of Virginia, Charlotesville, VA 22904, United States
| | - Ana Conesa
- Institute for Integrative Systems Biology, Spanish National Research Council, Paterna 46980, Spain
| | - Terry Gaasterland
- Bioinformatics & Systems Biology Graduate Program, La Jolla, CA 92093, United States
| | - Lawrence Hunter
- Department of Pediatrics, University of Chicago, Chicago, IL 60637, United States
| | - Carl Kingsford
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - Oliver Kohlbacher
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen 72076, Germany
| | - Thomas Lengauer
- Max Planck Institute for Informatics and Saarland Informatics Campus, Saarbrücken 66123, Germany
| | - Scott Markel
- Dassault Systèmes BIOVIA, San Diego, CA 92121, United States
| | - Yves Moreau
- Elektrotechniek ESAT-STADIUS, University of Leuven, Leuven 3000, Belgium
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, United States
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, London WC1E 6BT, United Kingdom
| | | | - Laxmi Parida
- IBM T J Watson Research, Yorktown Heights, NY 10598, United States
| | - Natasa Przulj
- Computational Biology Department, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi SE45 05, United Arab Emirates
- Barcelona Supercomputing Center, Barcelona 08034, Spain
- Institución Catalana de Investigación y Estudios Avanzados (ICREA), Barcelona 08010, Spain
- Department of Computer Science, University College London, London WC1E 6EA, United Kingdom
| | - Teresa M Przytycka
- Computational Biology Branch, Division of Intramural Research, National Library of Medicine, Bethesda, MD 20894, United States
| | - Shoba Ranganathan
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW 2109, Australia
| | - Russell Schwartz
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, United States
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - Alfonso Valencia
- Barcelona Supercomputing Center, Barcelona 08034, Spain
- Institución Catalana de Investigación y Estudios Avanzados (ICREA), Barcelona 08010, Spain
| | - Tandy Warnow
- School of Computing and Data Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
| |
Collapse
|
2
|
de Crécy-Lagard V, Dias R, Friedberg I, Yuan Y, Swairjo MA. Limitations of Current Machine-Learning Models in Predicting Enzymatic Functions for Uncharacterized Proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.01.601547. [PMID: 39005379 PMCID: PMC11244979 DOI: 10.1101/2024.07.01.601547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein "unknome". This large knowledge gap prevents the biological community from fully leveraging the plethora of genomic data that is now available. Machine-learning approaches are showing some promise in propagating functional knowledge from experimentally characterized proteins to the correct set of isofunctional orthologs. However, they largely fail to predict enzymatic functions unseen in the training set, as shown by dissecting the predictions made for over 450 enzymes of unknown function from the model bacteria Escherichia coli uxgsing the DeepECTransformer platform. Lessons from these failures can help the community develop machine-learning methods that assist domain experts in making testable functional predictions for more members of the uncharacterized proteome. Article Summary Many proteins in any genome, ranging from 30 to 70%, lack an assigned function. This knowledge gap limits the full use of the vast available genomic data. Machine learning has shown promise in transferring functional knowledge from proteins of known functions to similar ones, but largely fails to predict novel functions not seen in its training data. Understanding these failures can guide the development of better machine-learning methods to help experts make accurate functional predictions for uncharacterized proteins.
Collapse
|
3
|
Ullah S, Rahman W, Ullah F, Ullah A, Ahmad G, Ijaz M, Ullah H, Sharafmal DM. The HABD: Home of All Biological Databases Empowering Biological Research With Cutting-Edge Database Systems. Curr Protoc 2024; 4:e1063. [PMID: 38808697 DOI: 10.1002/cpz1.1063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/30/2024]
Abstract
The emergence of computer technologies and computing power has led to the development of several database systems that provide standardized access to vast quantities of data, making it possible to collect, search, index, evaluate, and extract useful knowledge across various fields. The Home of All Biological Databases (HABD) has been established as a continually expanding platform that aims to store, organize, and distribute biological data in a searchable manner, removing all dead and non-accessible data. The platform meticulously categorizes data into various categories, such as COVID-19 Pandemic Database (CO-19PDB), Database relevant to Human Research (DBHR), Cancer Research Database (CRDB), Latest Database of Protein Research (LDBPR), Fungi Databases Collection (FDBC), and many other databases that are categorized based on biological phenomena. It currently provides a total of 22 databases, including 6 published, 5 submitted, and the remaining in various stages of development. These databases encompass a range of areas, including phytochemical-specific and plastic biodegradation databases. HABD is equipped with search engine optimization (SEO) analyzer and Neil Patel tools, which ensure excellent SEO and high-speed value. With timely updates, HABD aims to facilitate the processing and visualization of data for scientists, providing a one-stop-shop for all biological databases. Computer platforms, such as PhP, html, CSS, Java script and Biopython, are used to build all the databases. © 2024 Wiley Periodicals LLC.
Collapse
Affiliation(s)
- Shahid Ullah
- S-Khan Lab, Mardan, Khyber Pakhtunkhwa, Pakistan
| | | | - Farhan Ullah
- S-Khan Lab, Mardan, Khyber Pakhtunkhwa, Pakistan
| | - Anees Ullah
- S-Khan Lab, Mardan, Khyber Pakhtunkhwa, Pakistan
| | - Gulzar Ahmad
- S-Khan Lab, Mardan, Khyber Pakhtunkhwa, Pakistan
| | | | - Hameed Ullah
- S-Khan Lab, Mardan, Khyber Pakhtunkhwa, Pakistan
| | | |
Collapse
|
4
|
Dembech E, Malatesta M, De Rito C, Mori G, Cavazzini D, Secchi A, Morandin F, Percudani R. Identification of hidden associations among eukaryotic genes through statistical analysis of coevolutionary transitions. Proc Natl Acad Sci U S A 2023; 120:e2218329120. [PMID: 37043529 PMCID: PMC10120013 DOI: 10.1073/pnas.2218329120] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Accepted: 03/10/2023] [Indexed: 04/13/2023] Open
Abstract
Coevolution at the gene level, as reflected by correlated events of gene loss or gain, can be revealed by phylogenetic profile analysis. The optimal method and metric for comparing phylogenetic profiles, especially in eukaryotic genomes, are not yet established. Here, we describe a procedure suitable for large-scale analysis, which can reveal coevolution based on the assessment of the statistical significance of correlated presence/absence transitions between gene pairs. This metric can identify coevolution in profiles with low overall similarities and is not affected by similarities lacking coevolutionary information. We applied the procedure to a large collection of 60,912 orthologous gene groups (orthogroups) in 1,264 eukaryotic genomes extracted from OrthoDB. We found significant cotransition scores for 7,825 orthogroups associated in 2,401 coevolving modules linking known and unknown genes in protein complexes and biological pathways. To demonstrate the ability of the method to predict hidden gene associations, we validated through experiments the involvement of vertebrate malate synthase-like genes in the conversion of (S)-ureidoglycolate into glyoxylate and urea, the last step of purine catabolism. This identification explains the presence of glyoxylate cycle genes in metazoa and suggests an anaplerotic role of purine degradation in early eukaryotes.
Collapse
Affiliation(s)
- Elena Dembech
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Marco Malatesta
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Carlo De Rito
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Giulia Mori
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Davide Cavazzini
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Andrea Secchi
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Francesco Morandin
- Department of Mathematical, Physical and Computer Sciences, University of Parma, Parma43124, Italy
| | - Riccardo Percudani
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| |
Collapse
|
5
|
Huynh TN, Stewart V. Purine catabolism by enterobacteria. Adv Microb Physiol 2023; 82:205-266. [PMID: 36948655 DOI: 10.1016/bs.ampbs.2023.01.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/13/2023]
Abstract
Purines are abundant among organic nitrogen sources and have high nitrogen content. Accordingly, microorganisms have evolved different pathways to catabolize purines and their metabolic products such as allantoin. Enterobacteria from the genera Escherichia, Klebsiella and Salmonella have three such pathways. First, the HPX pathway, found in the genus Klebsiella and very close relatives, catabolizes purines during aerobic growth, extracting all four nitrogen atoms in the process. This pathway includes several known or predicted enzymes not previously observed in other purine catabolic pathways. Second, the ALL pathway, found in strains from all three species, catabolizes allantoin during anaerobic growth in a branched pathway that also includes glyoxylate assimilation. This allantoin fermentation pathway originally was characterized in a gram-positive bacterium, and therefore is widespread. Third, the XDH pathway, found in strains from Escherichia and Klebsiella spp., at present is ill-defined but likely includes enzymes to catabolize purines during anaerobic growth. Critically, this pathway may include an enzyme system for anaerobic urate catabolism, a phenomenon not previously described. Documenting such a pathway would overturn the long-held assumption that urate catabolism requires oxygen. Overall, this broad capability for purine catabolism during either aerobic or anaerobic growth suggests that purines and their metabolites contribute to enterobacterial fitness in a variety of environments.
Collapse
Affiliation(s)
- TuAnh Ngoc Huynh
- Department of Food Science, University of Wisconsin, Madison, WI, United States
| | - Valley Stewart
- Department of Microbiology & Molecular Genetics, University of California, Davis, CA, United States.
| |
Collapse
|
6
|
de Crécy-lagard V, Amorin de Hegedus R, Arighi C, Babor J, Bateman A, Blaby I, Blaby-Haas C, Bridge AJ, Burley SK, Cleveland S, Colwell LJ, Conesa A, Dallago C, Danchin A, de Waard A, Deutschbauer A, Dias R, Ding Y, Fang G, Friedberg I, Gerlt J, Goldford J, Gorelik M, Gyori BM, Henry C, Hutinet G, Jaroch M, Karp PD, Kondratova L, Lu Z, Marchler-Bauer A, Martin MJ, McWhite C, Moghe GD, Monaghan P, Morgat A, Mungall CJ, Natale DA, Nelson WC, O’Donoghue S, Orengo C, O’Toole KH, Radivojac P, Reed C, Roberts RJ, Rodionov D, Rodionova IA, Rudolf JD, Saleh L, Sheynkman G, Thibaud-Nissen F, Thomas PD, Uetz P, Vallenet D, Carter EW, Weigele PR, Wood V, Wood-Charlson EM, Xu J. A roadmap for the functional annotation of protein families: a community perspective. Database (Oxford) 2022; 2022:baac062. [PMID: 35961013 PMCID: PMC9374478 DOI: 10.1093/database/baac062] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 06/28/2022] [Accepted: 08/03/2022] [Indexed: 12/23/2022]
Abstract
Over the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.
Collapse
Affiliation(s)
- Valérie de Crécy-lagard
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | | | - Cecilia Arighi
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19713, USA
| | - Jill Babor
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Ian Blaby
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Crysten Blaby-Haas
- Biology Department, Brookhaven National Laboratory, Upton, NY 11973, USA
| | - Alan J Bridge
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva 4 CH-1211, Switzerland
| | - Stephen K Burley
- RCSB Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Stacey Cleveland
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Lucy J Colwell
- Departmenf of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK
| | - Ana Conesa
- Spanish National Research Council, Institute for Integrative Systems Biology, Paterna, Valencia 46980, Spain
| | - Christian Dallago
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, i12, Boltzmannstr. 3, Garching/Munich 85748, Germany
| | - Antoine Danchin
- School of Biomedical Sciences, Li KaShing Faculty of Medicine, The University of Hong Kong, 21 Sassoon Road, Pokfulam, SAR Hong Kong 999077, China
| | - Anita de Waard
- Research Collaboration Unit, Elsevier, Jericho, VT 05465, USA
| | - Adam Deutschbauer
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Raquel Dias
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Yousong Ding
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, University of Florida, Gainesville, FL 32610, USA
| | - Gang Fang
- NYU-Shanghai, Shanghai 200120, China
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
| | - John Gerlt
- Institute for Genomic Biology and Departments of Biochemistry and Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Joshua Goldford
- Physics of Living Systems, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Mark Gorelik
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Benjamin M Gyori
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
| | - Christopher Henry
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Geoffrey Hutinet
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Marshall Jaroch
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Peter D Karp
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025, USA
| | | | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Maria-Jesus Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Claire McWhite
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
| | - Gaurav D Moghe
- Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, USA
| | - Paul Monaghan
- Department of Agricultural Education and Communication, University of Florida, Gainesville, FL 32611, USA
| | - Anne Morgat
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva 4 CH-1211, Switzerland
| | - Christopher J Mungall
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Darren A Natale
- Georgetown University Medical Center, Washington, DC 20007, USA
| | - William C Nelson
- Biological Sciences Division, Pacific Northwest National Laboratories, Richland, WA 99354, USA
| | - Seán O’Donoghue
- School of Biotechnology and Biomolecular Sciences, University of NSW, Sydney, NSW 2052, Australia
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | | | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| | - Colbie Reed
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | | | - Dmitri Rodionov
- Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA 92037, USA
| | - Irina A Rodionova
- Department of Bioengineering, Division of Engineering, University of California at San Diego, La Jolla, CA 92093-0412, USA
| | - Jeffrey D Rudolf
- Department of Chemistry, University of Florida, Gainesville, FL 32611, USA
| | - Lana Saleh
- New England Biolabs, Ipswich, MA 01938, USA
| | - Gloria Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
| | - Francoise Thibaud-Nissen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90033, USA
| | - Peter Uetz
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - David Vallenet
- LABGeM, Génomique Métabolique, CEA, Genoscope, Institut François Jacob, Université d’Évry, Université Paris-Saclay, CNRS, Evry 91057, France
| | - Erica Watson Carter
- Department of Plant Pathology, University of Florida Citrus Research and Education Center, 700 Experiment Station Rd., Lake Alfred, FL 33850, USA
| | | | - Valerie Wood
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK
| | - Elisha M Wood-Charlson
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Jin Xu
- Department of Plant Pathology, University of Florida Citrus Research and Education Center, 700 Experiment Station Rd., Lake Alfred, FL 33850, USA
| |
Collapse
|
7
|
Béchade B, Hu Y, Sanders JG, Cabuslay CS, Łukasik P, Williams BR, Fiers VJ, Lu R, Wertz JT, Russell JA. Turtle ants harbor metabolically versatile microbiomes with conserved functions across development and phylogeny. FEMS Microbiol Ecol 2022; 98:6602351. [PMID: 35660864 DOI: 10.1093/femsec/fiac068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Revised: 05/16/2022] [Accepted: 06/01/2022] [Indexed: 11/14/2022] Open
Abstract
Gut bacterial symbionts can support animal nutrition by facilitating digestion and providing valuable metabolites. However, changes in symbiotic roles between immature and adult stages are not well documented, especially in ants. Here, we explored the metabolic capabilities of microbiomes sampled from herbivorous turtle ant (Cephalotes sp.) larvae and adult workers through (meta)genomic screening and in vitro metabolic assays. We reveal that larval guts harbor bacterial symbionts with impressive metabolic capabilities, including catabolism of plant and fungal recalcitrant dietary fibers and energy-generating fermentation. Additionally, several members of the specialized adult gut microbiome, sampled downstream of an anatomical barrier that dams large food particles, show a conserved potential to depolymerize many dietary fibers. Symbionts from both life stages have the genomic capacity to recycle nitrogen and synthesize amino acids and B-vitamins. With help of their gut symbionts, including several bacteria likely acquired from the environment, turtle ant larvae may aid colony digestion and contribute to colony-wide nitrogen, B-vitamin and energy budgets. In addition, the conserved nature of the digestive capacities among adult-associated symbionts suggests that nutritional ecology of turtle ant colonies has long been shaped by specialized, behaviorally-transferred gut bacteria with over 45 million years of residency.
Collapse
Affiliation(s)
- Benoît Béchade
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Yi Hu
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America.,State Key Laboratory of Earth Surface Processes and Resource Ecology and Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Jon G Sanders
- Department of Ecology and Evolutionary Biology, Cornell University, Ithaca, New York, United States of America
| | - Christian S Cabuslay
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Piotr Łukasik
- Institute of Environmental Sciences, Jagiellonian University, Kraków, Poland
| | - Bethany R Williams
- Department of Biology, Calvin College, Grand Rapids, Michigan, United States of America
| | - Valerie J Fiers
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Richard Lu
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - John T Wertz
- Department of Biology, Calvin College, Grand Rapids, Michigan, United States of America
| | - Jacob A Russell
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
8
|
Ullah S, Ullah F, Rahman W, Karras DA, Ullah A, Ahmad G, Ijaz M, Gao T. CRDB: A Centralized Cancer Research DataBase and an example use case mining correlation statistics of cancer and covid-19 (Preprint). JMIR Cancer 2021; 8:e35020. [PMID: 35430561 PMCID: PMC9191331 DOI: 10.2196/35020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 02/07/2022] [Accepted: 04/10/2022] [Indexed: 11/13/2022] Open
Affiliation(s)
| | | | | | - Dimitrios A Karras
- Department General, Faculty of Science, National and Kapodistrian University of Athens, Athens, Greece
| | - Anees Ullah
- Kyrgyz State Medical University, Bishkek, Kyrgyzstan
| | | | | | - Tianshun Gao
- Research Center, The Seventh Affiliated Hospital of Sun Yat-sen University, Shenzhen, China
| |
Collapse
|
9
|
Chiu JKH, Ong RTH. ARGDIT: a validation and integration toolkit for Antimicrobial Resistance Gene Databases. Bioinformatics 2020; 35:2466-2474. [PMID: 30520940 DOI: 10.1093/bioinformatics/bty987] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2018] [Revised: 11/10/2018] [Accepted: 12/04/2018] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Antimicrobial resistance is currently one of the main challenges in public health due to the excessive use of antimicrobials in medical treatments and agriculture. The advancements in high-throughput next-generation sequencing and development of bioinformatics tools allow simultaneous detection and identification of antimicrobial resistance genes (ARGs) from clinical, food and environment samples, to monitor the prevalence and track the dissemination of these ARGs. Such analyses are however reliant on a comprehensive database of ARGs with accurate sequence content and annotation. Most of the current ARG databases are therefore manually curated, but this is a time-consuming process and the resulting curation errors could be hard to detect. Several secondary ARG databases consolidate contents from different source ARG databases, and hence modifications in the primary databases might not be propagated and updated promptly in the secondary ARG databases. RESULTS To address these problems, a validation and integration toolkit called ARGDIT was developed to validate ARG database fidelity, and merge multiple primary ARG databases into a single consolidated secondary ARG database with optional automated sequence re-annotation. Experimental results demonstrated the effectiveness of this toolkit in identifying errors such as sequence annotation typos in current ARG databases and generating an integrated non-redundant ARG database with structured annotation. A toolkit-oriented workflow is also proposed to minimize the efforts in validating, curating and merging multiple ARG protein or coding sequence databases. Database developers therefore benefit from faster update cycles and lower costs for database maintenance, while ARG pipeline users can easily evaluate the reference ARG database quality. AVAILABILITY AND IMPLEMENTATION ARGDIT is available at https://github.com/phglab/ARGDIT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
10
|
Paight C, Slamovits CH, Saffo MB, Lane CE. Nephromyces Encodes a Urate Metabolism Pathway and Predicted Peroxisomes, Demonstrating That These Are Not Ancient Losses of Apicomplexans. Genome Biol Evol 2019; 11:41-53. [PMID: 30500900 PMCID: PMC6320678 DOI: 10.1093/gbe/evy251] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/28/2018] [Indexed: 12/21/2022] Open
Abstract
The phylum Apicomplexa is a quintessentially parasitic lineage, whose members infect a broad range of animals. One exception to this may be the apicomplexan genus Nephromyces, which has been described as having a mutualistic relationship with its host. Here we analyze transcriptome data from Nephromyces and its parasitic sister taxon, Cardiosporidium, revealing an ancestral purine degradation pathway thought to have been lost early in apicomplexan evolution. The predicted localization of many of the purine degradation enzymes to peroxisomes, and the in silico identification of a full set of peroxisome proteins, indicates that loss of both features in other apicomplexans occurred multiple times. The degradation of purines is thought to play a key role in the unusual relationship between Nephromyces and its host. Transcriptome data confirm previous biochemical results of a functional pathway for the utilization of uric acid as a primary nitrogen source for this unusual apicomplexan.
Collapse
Affiliation(s)
| | - Claudio H Slamovits
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Mary Beth Saffo
- Department of Biological Sciences, University of Rhode Island
- Smithsonian National Museum of Natural History, Washington, District of Columbia
| | | |
Collapse
|
11
|
Bouadjenek MR, Verspoor K, Zobel J. Automated detection of records in biological sequence databases that are inconsistent with the literature. J Biomed Inform 2017. [PMID: 28624643 DOI: 10.1016/j.jbi.2017.06.015] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.
Collapse
Affiliation(s)
- Mohamed Reda Bouadjenek
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| | - Karin Verspoor
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| | - Justin Zobel
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| |
Collapse
|
12
|
Abstract
Genomic studies focus on key metabolites and pathways that, despite their obvious anthropocentric design, keep being 'predicted', while this is only finding again what is already known. As increasingly more genomes are sequenced, this lightpost effect may account at least in part for our failure to understand the function of a continuously growing number of genes. Core metabolism often goes astray, accidentally producing a variety of unexpected compounds. Catabolism of these forgotten metabolites makes an essential part of the functions coded in metagenomes. Here, I explore the fate of a limited number of those: compounds resulting from radical reactions and molecules derived from some reactive intermediates produced during normal metabolism. I try both to update investigators with the most recent literature and to uncover old articles that may open up new research avenues in the genome exploration of metabolism. This should allow us to foresee further developments in experimental genomics and genome annotation.
Collapse
Affiliation(s)
- Antoine Danchin
- Institute of Cardiometabolism and NutritionHôpital de la Pitié‐Salpêtrière47 Boulevard de l'HôpitalParis75013France
| |
Collapse
|
13
|
Zallot R, Harrison KJ, Kolaczkowski B, de Crécy-Lagard V. Functional Annotations of Paralogs: A Blessing and a Curse. Life (Basel) 2016; 6:life6030039. [PMID: 27618105 PMCID: PMC5041015 DOI: 10.3390/life6030039] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Revised: 08/29/2016] [Accepted: 09/02/2016] [Indexed: 12/15/2022] Open
Abstract
Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines.
Collapse
Affiliation(s)
- Rémi Zallot
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Katherine J Harrison
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Bryan Kolaczkowski
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Valérie de Crécy-Lagard
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| |
Collapse
|
14
|
Chen Q, Zobel J, Zhang X, Verspoor K. Supervised Learning for Detection of Duplicates in Genomic Sequence Databases. PLoS One 2016; 11:e0159644. [PMID: 27489953 PMCID: PMC4973881 DOI: 10.1371/journal.pone.0159644] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2016] [Accepted: 07/06/2016] [Indexed: 01/30/2023] Open
Abstract
MOTIVATION First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. RESULTS We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.
Collapse
Affiliation(s)
- Qingyu Chen
- Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Justin Zobel
- Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Xiuzhen Zhang
- School of Science, RMIT University, Melbourne, Australia
| | - Karin Verspoor
- Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
- * E-mail:
| |
Collapse
|
15
|
Promponas VJ, Iliopoulos I, Ouzounis CA. Annotation inconsistencies beyond sequence similarity-based function prediction - phylogeny and genome structure. Stand Genomic Sci 2015; 10:108. [PMID: 26594309 PMCID: PMC4653902 DOI: 10.1186/s40793-015-0101-2] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2015] [Accepted: 11/11/2015] [Indexed: 12/15/2022] Open
Abstract
The function annotation process in computational biology has increasingly shifted from the traditional characterization of individual biochemical roles of protein molecules to the system-wide detection of entire metabolic pathways and genomic structures. The so-called genome-aware methods broaden misannotation inconsistencies in genome sequences beyond protein function assignments, encompassing phylogenetic anomalies and artifactual genomic regions. We outline three categories of error propagation in databases by providing striking examples – at various levels of appreciation by the community from traditional to emerging, thus raising awareness for future solutions.
Collapse
Affiliation(s)
- Vasilis J Promponas
- Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, PO Box 20537, CY-1678 Nicosia, Cyprus
| | - Ioannis Iliopoulos
- Division of Medical Sciences, University of Crete Medical School, GR-71110 Heraklion, Greece
| | - Christos A Ouzounis
- Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece
| |
Collapse
|
16
|
Purine utilization proteins in the Eurotiales: Cellular compartmentalization, phylogenetic conservation and divergence. Fungal Genet Biol 2014; 69:96-108. [DOI: 10.1016/j.fgb.2014.06.005] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2014] [Revised: 05/29/2014] [Accepted: 06/10/2014] [Indexed: 12/28/2022]
|
17
|
Keseler IM, Skrzypek M, Weerasinghe D, Chen AY, Fulcher C, Li GW, Lemmer KC, Mladinich KM, Chow ED, Sherlock G, Karp PD. Curation accuracy of model organism databases. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau058. [PMID: 24923819 PMCID: PMC4207230 DOI: 10.1093/database/bau058] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Manual extraction of information from the biomedical literature-or biocuration-is the central methodology used to construct many biological databases. For example, the UniProt protein database, the EcoCyc Escherichia coli database and the Candida Genome Database (CGD) are all based on biocuration. Biological databases are used extensively by life science researchers, as online encyclopedias, as aids in the interpretation of new experimental data and as golden standards for the development of new bioinformatics algorithms. Although manual curation has been assumed to be highly accurate, we are aware of only one previous study of biocuration accuracy. We assessed the accuracy of EcoCyc and CGD by manually selecting curated assertions within randomly chosen EcoCyc and CGD gene pages and by then validating that the data found in the referenced publications supported those assertions. A database assertion is considered to be in error if that assertion could not be found in the publication cited for that assertion. We identified 10 errors in the 633 facts that we validated across the two databases, for an overall error rate of 1.58%, and individual error rates of 1.82% for CGD and 1.40% for EcoCyc. These data suggest that manual curation of the experimental literature by Ph.D-level scientists is highly accurate. Database URL: http://ecocyc.org/, http://www.candidagenome.org//
Collapse
Affiliation(s)
- Ingrid M Keseler
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, CA, USA, Department of Genetics, Stanford University, CA 94305, USA, Department of Bacteriology, University of Wisconsin, WI 53706-1521, USA, Department of Cellular and Molecular Pharmacology, University of California at San Francisco, CA 94158-2140, USA, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, WI 53726, USA and Department of Medical Microbiology and Immunology, University of Wisconsin, WI 53706-1521, USA
| | - Marek Skrzypek
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, CA, USA, Department of Genetics, Stanford University, CA 94305, USA, Department of Bacteriology, University of Wisconsin, WI 53706-1521, USA, Department of Cellular and Molecular Pharmacology, University of California at San Francisco, CA 94158-2140, USA, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, WI 53726, USA and Department of Medical Microbiology and Immunology, University of Wisconsin, WI 53706-1521, USA
| | - Deepika Weerasinghe
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, CA, USA, Department of Genetics, Stanford University, CA 94305, USA, Department of Bacteriology, University of Wisconsin, WI 53706-1521, USA, Department of Cellular and Molecular Pharmacology, University of California at San Francisco, CA 94158-2140, USA, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, WI 53726, USA and Department of Medical Microbiology and Immunology, University of Wisconsin, WI 53706-1521, USA
| | - Albert Y Chen
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, CA, USA, Department of Genetics, Stanford University, CA 94305, USA, Department of Bacteriology, University of Wisconsin, WI 53706-1521, USA, Department of Cellular and Molecular Pharmacology, University of California at San Francisco, CA 94158-2140, USA, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, WI 53726, USA and Department of Medical Microbiology and Immunology, University of Wisconsin, WI 53706-1521, USA
| | - Carol Fulcher
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, CA, USA, Department of Genetics, Stanford University, CA 94305, USA, Department of Bacteriology, University of Wisconsin, WI 53706-1521, USA, Department of Cellular and Molecular Pharmacology, University of California at San Francisco, CA 94158-2140, USA, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, WI 53726, USA and Department of Medical Microbiology and Immunology, University of Wisconsin, WI 53706-1521, USA
| | - Gene-Wei Li
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, CA, USA, Department of Genetics, Stanford University, CA 94305, USA, Department of Bacteriology, University of Wisconsin, WI 53706-1521, USA, Department of Cellular and Molecular Pharmacology, University of California at San Francisco, CA 94158-2140, USA, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, WI 53726, USA and Department of Medical Microbiology and Immunology, University of Wisconsin, WI 53706-1521, USA
| | - Kimberly C Lemmer
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, CA, USA, Department of Genetics, Stanford University, CA 94305, USA, Department of Bacteriology, University of Wisconsin, WI 53706-1521, USA, Department of Cellular and Molecular Pharmacology, University of California at San Francisco, CA 94158-2140, USA, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, WI 53726, USA and Department of Medical Microbiology and Immunology, University of Wisconsin, WI 53706-1521, USA
| | - Katherine M Mladinich
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, CA, USA, Department of Genetics, Stanford University, CA 94305, USA, Department of Bacteriology, University of Wisconsin, WI 53706-1521, USA, Department of Cellular and Molecular Pharmacology, University of California at San Francisco, CA 94158-2140, USA, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, WI 53726, USA and Department of Medical Microbiology and Immunology, University of Wisconsin, WI 53706-1521, USA
| | - Edmond D Chow
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, CA, USA, Department of Genetics, Stanford University, CA 94305, USA, Department of Bacteriology, University of Wisconsin, WI 53706-1521, USA, Department of Cellular and Molecular Pharmacology, University of California at San Francisco, CA 94158-2140, USA, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, WI 53726, USA and Department of Medical Microbiology and Immunology, University of Wisconsin, WI 53706-1521, USA
| | - Gavin Sherlock
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, CA, USA, Department of Genetics, Stanford University, CA 94305, USA, Department of Bacteriology, University of Wisconsin, WI 53706-1521, USA, Department of Cellular and Molecular Pharmacology, University of California at San Francisco, CA 94158-2140, USA, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, WI 53726, USA and Department of Medical Microbiology and Immunology, University of Wisconsin, WI 53706-1521, USA
| | - Peter D Karp
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, CA, USA, Department of Genetics, Stanford University, CA 94305, USA, Department of Bacteriology, University of Wisconsin, WI 53706-1521, USA, Department of Cellular and Molecular Pharmacology, University of California at San Francisco, CA 94158-2140, USA, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, WI 53726, USA and Department of Medical Microbiology and Immunology, University of Wisconsin, WI 53706-1521, USA
| |
Collapse
|
18
|
Poux S, Magrane M, Arighi CN, Bridge A, O'Donovan C, Laiho K. Expert curation in UniProtKB: a case study on dealing with conflicting and erroneous data. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau016. [PMID: 24622611 PMCID: PMC3950660 DOI: 10.1093/database/bau016] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
UniProtKB/Swiss-Prot provides expert curation with information extracted from literature and curator-evaluated computational analysis. As knowledgebases continue to play an increasingly important role in scientific research, a number of studies have evaluated their accuracy and revealed various errors. While some are curation errors, others are the result of incorrect information published in the scientific literature. By taking the example of sirtuin-5, a complex annotation case, we will describe the curation procedure of UniProtKB/Swiss-Prot and detail how we report conflicting information in the database. We will demonstrate the importance of collaboration between resources to ensure curation consistency and the value of contributions from the user community in helping maintain error-free resources. Database URL:www.uniprot.org
Collapse
Affiliation(s)
- Sylvain Poux
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland, European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Protein Information Resource, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE 19711, USA and Protein Information Resource, Georgetown University Medical Center, 3300 Whitehaven Street North West, Suite 1200, Washington, DC 20007, USA
| | | | | | | | | | | | | |
Collapse
|