1
|
Phan A, Joshi P, Kadelka C, Friedberg I. A longitudinal analysis of function annotations of the human proteome reveals consistently high biases. Database (Oxford) 2025; 2025:baaf036. [PMID: 40338520 PMCID: PMC12060720 DOI: 10.1093/database/baaf036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Revised: 02/28/2025] [Accepted: 04/08/2025] [Indexed: 05/09/2025]
Abstract
The resources required to study gene function are limited, especially when considering the number of genes in the human genome and the complexity of their function. Therefore, genes are prioritized for experimental studies based on many different considerations, including, but not limited to, perceived biomedical importance, such as disease-associated genes, or the understanding of biological processes, such as cell signalling pathways. At the same time, most genes are not studied or are under-characterized, which hampers our understanding of their function and potential effects on human health and wellness. Understanding function annotation disparity is a necessary first step toward understanding how much functional knowledge is gained from the human genome, and toward guidelines for better targeting future studies of the genes in the human genome effectively. Here, we present a comprehensive longitudinal analysis of the human proteome utilizing data analysis tools from economics and information theory. Specifically, we view the human proteome as a population of proteins within a knowledge economy: we treat the quantified knowledge of the protein's function as the analogue of wealth and examine the distribution of information in a population of proteins in the proteome in the same manner distribution of wealth is studied in societies. Our results show a highly skewed distribution of information about human proteins over the last decade, in which the inequality in the annotations given to the proteins remains high. Additionally, we examine the correlation between the knowledge about protein function as captured in databases and the interest in proteins as reflected by mentions in the scientific literature. We show a large gap between knowledge and interest and dissect the factors leading to this gap. In conclusion, our study shows that research efforts should be redirected to less studied proteins to mitigate the disparity among human proteins both in databases and literature.
Collapse
Affiliation(s)
- An Phan
- Program in Bioinformatics and Computational Biology, Iowa State University, Ames, IA, United States
- Department of Mathematics, Iowa State University, Ames, IA, United States
| | - Parnal Joshi
- Program in Bioinformatics and Computational Biology, Iowa State University, Ames, IA, United States
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, United States
| | - Claus Kadelka
- Department of Mathematics, Iowa State University, Ames, IA, United States
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, United States
| |
Collapse
|
2
|
Davies SR. Working in biocuration: contemporary experiences and perspectives. Database (Oxford) 2025; 2025:baaf003. [PMID: 39937660 PMCID: PMC11817794 DOI: 10.1093/database/baaf003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Revised: 01/02/2025] [Accepted: 01/08/2025] [Indexed: 02/14/2025]
Abstract
This perspective article synthesizes current knowledge regarding what is known regarding biocuration as a career and the challenges facing the field. It draws on existing literature and ongoing qualitative research to discuss the nature of biocuration, biocurators' career trajectories, key challenges that biocurators face, and strategies for overcoming these. Overall, biocurators express a high degree of satisfaction with their work and see it as central to the wider biosciences. The central challenges that they face relate to the underfunding and under-recognition of this work, meaning that there is minimal stable funding for the field and that the work of human biocurators is often invisible to those who use curated resources. The article closes by critically discussing existing and potential strategies for responding to these challenges.
Collapse
Affiliation(s)
- Sarah R Davies
- Department of Science and Technology Studies, University of Vienna, Universitätsstraße 7, 6. Stock (NIG), Vienna, 1010, Austria
| |
Collapse
|
3
|
Smith N, Yuan X, Melissinos C, Moghe G. FuncFetch: an LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts. Bioinformatics 2024; 41:btae756. [PMID: 39718779 PMCID: PMC11734755 DOI: 10.1093/bioinformatics/btae756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 11/16/2024] [Accepted: 12/20/2024] [Indexed: 12/25/2024] Open
Abstract
MOTIVATION Thousands of genomes are publicly available, however, most genes in those genomes have poorly defined functions. This is partly due to a gap between previously published, experimentally characterized protein activities and activities deposited in databases. This activity deposition is bottlenecked by the time-consuming biocuration process. The emergence of large language models presents an opportunity to speed up the text-mining of protein activities for biocuration. RESULTS We developed FuncFetch-a workflow that integrates NCBI E-Utilities, OpenAI's GPT-4, and Zotero-to screen thousands of manuscripts and extract enzyme activities. Extensive validation revealed high precision and recall of GPT-4 in determining whether the abstract of a given paper indicates the presence of a characterized enzyme activity in that paper. Provided the manuscript, FuncFetch extracted data such as species information, enzyme names, sequence identifiers, substrates, and products, which were subjected to extensive quality analyses. Comparison of this workflow against a manually curated dataset of BAHD acyltransferase activities demonstrated a precision/recall of 0.86/0.64 in extracting substrates. We further deployed FuncFetch on nine large plant enzyme families. Screening 26 543 papers, FuncFetch retrieved 32 605 entries from 5459 selected papers. We also identified multiple extraction errors including incorrect associations, nontarget enzymes, and hallucinations, which highlight the need for further manual curation. The BAHD activities were verified, resulting in a comprehensive functional fingerprint of this family and revealing that ∼70% of the experimentally characterized enzymes are uncurated in the public domain. FuncFetch represents an advance in biocuration and lays the groundwork for predicting the functions of uncharacterized enzymes. AVAILABILITY AND IMPLEMENTATION Code and minimally curated activities are available at: https://github.com/moghelab/funcfetch and https://tools.moghelab.org/funczymedb.
Collapse
Affiliation(s)
- Nathaniel Smith
- Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, United States
| | - Xinyu Yuan
- Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, United States
| | - Chesney Melissinos
- Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, United States
| | - Gaurav Moghe
- Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, United States
| |
Collapse
|
4
|
Scorza LC, Zieliński T, Kalita I, Lepore A, El Karoui M, Millar AJ. Daily life in the Open Biologist's second job, as a Data Curator. Wellcome Open Res 2024; 9:523. [PMID: 39360219 PMCID: PMC11445645 DOI: 10.12688/wellcomeopenres.22899.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/03/2024] [Indexed: 10/04/2024] Open
Abstract
Background Data reusability is the driving force of the research data life cycle. However, implementing strategies to generate reusable data from the data creation to the sharing stages is still a significant challenge. Even when datasets supporting a study are publicly shared, the outputs are often incomplete and/or not reusable. The FAIR (Findable, Accessible, Interoperable, Reusable) principles were published as a general guidance to promote data reusability in research, but the practical implementation of FAIR principles in research groups is still falling behind. In biology, the lack of standard practices for a large diversity of data types, data storage and preservation issues, and the lack of familiarity among researchers are some of the main impeding factors to achieve FAIR data. Past literature describes biological curation from the perspective of data resources that aggregate data, often from publications. Methods Our team works alongside data-generating, experimental researchers so our perspective aligns with publication authors rather than aggregators. We detail the processes for organizing datasets for publication, showcasing practical examples from data curation to data sharing. We also recommend strategies, tools and web resources to maximize data reusability, while maintaining research productivity. Conclusion We propose a simple approach to address research data management challenges for experimentalists, designed to promote FAIR data sharing. This strategy not only simplifies data management, but also enhances data visibility, recognition and impact, ultimately benefiting the entire scientific community.
Collapse
Affiliation(s)
- Livia C.T. Scorza
- Centre for Engineering Biology and School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3BF, UK
| | - Tomasz Zieliński
- Centre for Engineering Biology and School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3BF, UK
| | - Irina Kalita
- Centre for Engineering Biology and School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3BF, UK
- Institute of Cell Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3JD, UK
- Center for Synthetic Microbiology (SYNMIKRO), Max Planck Institute for Terrestrial Microbiology, Marburg, Germany
| | - Alessia Lepore
- Centre for Engineering Biology and School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3BF, UK
- Institute of Cell Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3JD, UK
- Laboratory for Optics and Biosciences, École Polytechnique, Institut Polytechnique de Paris, Palaiseau, Île-de-France, France
| | - Meriem El Karoui
- Centre for Engineering Biology and School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3BF, UK
- Institute of Cell Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3JD, UK
- Laboratoire de Biologie et Pharmacologie Appliquée (LBPA), - ENS Paris-Saclay CNRS UMR 8113, Paris, Gif-sur-Yvette, France
| | - Andrew J. Millar
- Centre for Engineering Biology and School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3BF, UK
| |
Collapse
|
5
|
Bonnici V, Chicco D. Seven quick tips for gene-focused computational pangenomic analysis. BioData Min 2024; 17:28. [PMID: 39227987 PMCID: PMC11370085 DOI: 10.1186/s13040-024-00380-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Accepted: 08/12/2024] [Indexed: 09/05/2024] Open
Abstract
Pangenomics is a relatively new scientific field which investigates the union of all the genomes of a clade. The word pan means everything in ancient Greek; the term pangenomics originally regarded genomes of bacteria and was later intended to refer to human genomes as well. Modern bioinformatics offers several tools to analyze pangenomics data, paving the way to an emerging field that we can call computational pangenomics. Current computational power available for the bioinformatics community has made computational pangenomic analyses easy to perform, but this higher accessibility to pangenomics analysis also increases the chances to make mistakes and to produce misleading or inflated results, especially by beginners. To handle this problem, we present here a few quick tips for efficient and correct computational pangenomic analyses with a focus on bacterial pangenomics, by describing common mistakes to avoid and experienced best practices to follow in this field. We believe our recommendations can help the readers perform more robust and sound pangenomic analyses and to generate more reliable results.
Collapse
Affiliation(s)
- Vincenzo Bonnici
- Dipartimento di Scienze Matematiche Fisiche e Informatiche, Università di Parma, Parma, Italy.
| | - Davide Chicco
- Dipartimento di Informatica Sistemistica e Comunicazione, Università di Milano-Bicocca, Milan, Italy.
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada.
| |
Collapse
|
6
|
Abdill RJ, Talarico E, Grieneisen L. A how-to guide for code sharing in biology. PLoS Biol 2024; 22:e3002815. [PMID: 39255324 PMCID: PMC11414921 DOI: 10.1371/journal.pbio.3002815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Revised: 09/20/2024] [Indexed: 09/12/2024] Open
Abstract
In 2024, all biology is computational biology. Computer-aided analysis continues to spread into new fields, becoming more accessible to researchers trained in the wet lab who are eager to take advantage of growing datasets, falling costs, and novel assays that present new opportunities for discovery. It is currently much easier to find guidance for implementing these techniques than for reporting their use, leaving biologists to guess which details and files are relevant. In this essay, we review existing literature on the topic, summarize common tips, and link to additional resources for training. Following this overview, we then provide a set of recommendations for sharing code, with an eye toward guiding those who are comparatively new to applying open science principles to their computational work. Taken together, we provide a guide for biologists who seek to follow code sharing best practices but are unsure where to start.
Collapse
Affiliation(s)
- Richard J. Abdill
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
| | - Emma Talarico
- Department of Biology, University of British Columbia—Okanagan Campus, Kelowna, British Columbia, Canada
| | - Laura Grieneisen
- Department of Biology, University of British Columbia—Okanagan Campus, Kelowna, British Columbia, Canada
- Okanagan Institute for Biodiversity, Resilience, and Ecosystem Services, University of British Columbia—Okanagan Campus, Kelowna, British Columbia, Canada
| |
Collapse
|
7
|
Arabi-Jeshvaghani F, Javadi-Zarnaghi F, Löchel HF, Martin R, Heider D. LAMPPrimerBank, a manually curated database of experimentally validated loop-mediated isothermal amplification primers for detection of respiratory pathogens. Infection 2023; 51:1809-1818. [PMID: 37828369 DOI: 10.1007/s15010-023-02100-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 09/13/2023] [Indexed: 10/14/2023]
Abstract
PURPOSE AND METHODS The emergence of coronavirus disease 2019 (COVID-19) has once again affirmed the significant threat of respiratory infections to global public health and the utmost importance of prompt diagnosis in managing and mitigating any pandemic. The nucleic acid amplification test (NAAT) is the primary detection method for most pathogens. Loop-mediated isothermal amplification (LAMP) is a rapid, simple, sensitive, and specific epitome of isothermal NAAT performed using a set of four to six primers. Primer design is a fundamental step in LAMP assays, with several complexities and experimental screening requirements. To address this challenge, an online database is presented here. Its workflow comprises three steps: literature aggregation, data curation, and database and website implementation. RESULTS LAMPPrimerBank ( https://lampprimerbank.mathematik.uni-marburg.de ) is a manually curated database dedicated to experimentally validated LAMP primers, their peculiarities of assays, and accompanying literature, with a primary emphasis on respiratory pathogens. LAMPPrimerBank, with its user-friendly web interface and an open application programming interface, enables the accelerated and facile exploration, comparison, and exportation of LAMP primer sequences and their respective information from the massively scattered literature. LAMPPrimerBank currently comprises LAMP primers for diagnosing viral, bacterial, and fungal respiratory pathogens. Additionally, to address the challenge of false-positive results generated by nonspecific amplifications, LAMPPrimerBank computationally predicted and visualized the sizes of LAMP products for recorded primer sets in the database. CONCLUSION LAMPPrimerBank, as a pioneering database in the rapidly expanding field of isothermal NAAT, endeavors to confront the two challenges of the LAMP: primer design and discrimination of false-positive results.
Collapse
Affiliation(s)
- Fatemeh Arabi-Jeshvaghani
- Department of Cell and Molecular Biology & Microbiology, Faculty of Biological Science and Technology, University of Isfahan, Isfahan, Iran
| | - Fatemeh Javadi-Zarnaghi
- Department of Cell and Molecular Biology & Microbiology, Faculty of Biological Science and Technology, University of Isfahan, Isfahan, Iran.
| | - Hannah Franziska Löchel
- Department of Data Science in Biomedicine, Faculty of Mathematics and Computer Science, University of Marburg, Marburg, Germany
| | - Roman Martin
- Department of Data Science in Biomedicine, Faculty of Mathematics and Computer Science, University of Marburg, Marburg, Germany
| | - Dominik Heider
- Department of Data Science in Biomedicine, Faculty of Mathematics and Computer Science, University of Marburg, Marburg, Germany
| |
Collapse
|
8
|
Lubiana T, Lopes R, Medeiros P, Silva JC, Goncalves ANA, Maracaja-Coutinho V, Nakaya HI. Ten quick tips for harnessing the power of ChatGPT in computational biology. PLoS Comput Biol 2023; 19:e1011319. [PMID: 37561669 PMCID: PMC10414555 DOI: 10.1371/journal.pcbi.1011319] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2023] Open
Affiliation(s)
- Tiago Lubiana
- School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
| | - Rafael Lopes
- Department of Epidemiology of Microbial Diseases and Public Health Modeling Unit, Yale School of Public Health, New Haven, Connecticut, United States of America
| | | | - Juan Carlo Silva
- School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
| | | | - Vinicius Maracaja-Coutinho
- Advanced Center for Chronic Diseases, Universidad de Chile, Santiago, Chile
- Centro de Modelamiento Molecular, Biofísica y Bioinformática—CM2B2, Facultad de Ciencias Químicas y Farmacéuticas, Universidad de Chile, Santiago, Chile
- ANID Anillo ACT210004 SYSTEMIX, Rancagua, Chile
- Anillo Inflammation in HIV/AIDS—InflammAIDS, Santiago, Chile
- Beagle Bioinformatics, São Paulo, Brasil & Santiago, Chile
| | - Helder I. Nakaya
- School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
- Hospital Israelita Albert Einstein, São Paulo, Brazil
| |
Collapse
|
9
|
Mazein A, Acencio ML, Balaur I, Rougny A, Welter D, Niarakis A, Ramirez Ardila D, Dogrusoz U, Gawron P, Satagopam V, Gu W, Kremer A, Schneider R, Ostaszewski M. A guide for developing comprehensive systems biology maps of disease mechanisms: planning, construction and maintenance. FRONTIERS IN BIOINFORMATICS 2023; 3:1197310. [PMID: 37426048 PMCID: PMC10325725 DOI: 10.3389/fbinf.2023.1197310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 06/09/2023] [Indexed: 07/11/2023] Open
Abstract
As a conceptual model of disease mechanisms, a disease map integrates available knowledge and is applied for data interpretation, predictions and hypothesis generation. It is possible to model disease mechanisms on different levels of granularity and adjust the approach to the goals of a particular project. This rich environment together with requirements for high-quality network reconstruction makes it challenging for new curators and groups to be quickly introduced to the development methods. In this review, we offer a step-by-step guide for developing a disease map within its mainstream pipeline that involves using the CellDesigner tool for creating and editing diagrams and the MINERVA Platform for online visualisation and exploration. We also describe how the Neo4j graph database environment can be used for managing and querying efficiently such a resource. For assessing the interoperability and reproducibility we apply FAIR principles.
Collapse
Affiliation(s)
- Alexander Mazein
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Marcio Luis Acencio
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Irina Balaur
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | | | - Danielle Welter
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Anna Niarakis
- Université Paris-Saclay, Laboratoire Européen de Recherche Pour la Polyarthrite Rhumatoïde–Genhotel, University Evry, Evry, France
- Lifeware Group, Inria Saclay-Ile de France, Palaiseau, France
| | - Diana Ramirez Ardila
- ITTM Information Technology for Translational Medicine, Esch-sur-Alzette, Luxemburg
| | - Ugur Dogrusoz
- Computer Engineering Department, Bilkent University, Ankara, Türkiye
| | - Piotr Gawron
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Venkata Satagopam
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
- ELIXIR Luxembourg, Belvaux, Luxembourg
| | - Wei Gu
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
- ELIXIR Luxembourg, Belvaux, Luxembourg
| | - Andreas Kremer
- ITTM Information Technology for Translational Medicine, Esch-sur-Alzette, Luxemburg
| | - Reinhard Schneider
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
- ELIXIR Luxembourg, Belvaux, Luxembourg
| | - Marek Ostaszewski
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
- ELIXIR Luxembourg, Belvaux, Luxembourg
| |
Collapse
|
10
|
Gavgani HN, Grotewold E, Gray J. Methodology for Constructing a Knowledgebase for Plant Gene Regulation Information. Methods Mol Biol 2023; 2698:277-300. [PMID: 37682481 DOI: 10.1007/978-1-0716-3354-0_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Abstract
The amount of biological data is growing at a rapid pace as many high-throughput omics technologies and data pipelines are developed. This is resulting in the growth of databases for DNA and protein sequences, gene expression, protein accumulation, structural, and localization information. The diversity and multi-omics nature of such bioinformatic data requires well-designed databases for flexible organization and presentation. Besides general-purpose online bioinformatic databases, users need narrowly focused online databases to quickly access a meaningful collection of related data for their research. Here, we describe the methodology used to implement a plant gene regulatory knowledgebase, with data, query, and tool features, as well as the ability to expand to accommodate future datasets. We exemplify this methodology for the GRASSIUS knowledgebase, but it is applicable to developing and updating similar plant gene regulatory knowledgebases. GRASSIUS organizes and presents gene regulatory data from grass species with a central focus on maize (Zea mays). The main class of data presented include not only the families of transcription factors (TFs) and co-regulators (CRs) but also protein-DNA interaction data, where available.
Collapse
Affiliation(s)
- Hadi Nayebi Gavgani
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA
- Dandelions Therapeutics Inc., San Francisco, CA, USA
| | - Erich Grotewold
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA
| | - John Gray
- Department of Biological Sciences, University of Toledo, Toledo, OH, USA.
| |
Collapse
|
11
|
Abstract
Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call "feature" a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.
Collapse
Affiliation(s)
- Davide Chicco
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Luca Oneto
- Dipartimento di Informatica Bioingegneria Robotica e Ingegneria dei Sistemi, Università di Genova, Genoa, Italy
- ZenaByte S.r.l., Genoa, Italy
| | - Erica Tavazzi
- Dipartimento di Ingegneria dell’Informazione, Università di Padova, Padua, Italy
| |
Collapse
|
12
|
Yoon A, Kim J, Donaldson DR. Big data curation framework: Curation actions and challenges. J Inf Sci 2022. [DOI: 10.1177/01655515221133528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Big data curation represents an emerging topic of inquiry but still in an early phase along its adoption curve. The term big data itself is a nebulous concept, and the differences between small data curation and big data curation are nuanced. The goal of this research is to provide a theoretical framework that identifies big data curation actions and associated curation challenges. This study is based on the practices of big data research and data curation by systematically examining literature. The outcome of the study includes the big data curation framework that provides overview of curation activities and concerns that are essential to perform such activities. The study also provides practical implications for libraries, archives, data repositories and other information organisations that concerns the issue of big data curation as big data presents a multidimensional array of exigencies in relation to the mission of those organisations.
Collapse
Affiliation(s)
- Ayoung Yoon
- Department of Library and Information Science, School of Informatics and Computing, Indiana University–Purdue University Indianapolis (IUPUI), USA
| | - Jihyun Kim
- Department of Library & Information Science, Ewha Womans University, South Korea
| | - Devan Ray Donaldson
- Department of Information and Library Science, Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, USA
| |
Collapse
|
13
|
Matentzoglu N, Goutte-Gattat D, Tan SZK, Balhoff JP, Carbon S, Caron AR, Duncan WD, Flack JE, Haendel M, Harris NL, Hogan WR, Hoyt CT, Jackson RC, Kim H, Kir H, Larralde M, McMurry JA, Overton JA, Peters B, Pilgrim C, Stefancsik R, Robb SMC, Toro S, Vasilevsky NA, Walls R, Mungall CJ, Osumi-Sutherland D. Ontology Development Kit: a toolkit for building, maintaining and standardizing biomedical ontologies. Database (Oxford) 2022; 2022:6754192. [PMID: 36208225 PMCID: PMC9547537 DOI: 10.1093/database/baac087] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 08/19/2022] [Accepted: 09/23/2022] [Indexed: 11/21/2022]
Abstract
Similar to managing software packages, managing the ontology life cycle involves multiple complex workflows such as preparing releases, continuous quality control checking and dependency management. To manage these processes, a diverse set of tools is required, from command-line utilities to powerful ontology-engineering environmentsr. Particularly in the biomedical domain, which has developed a set of highly diverse yet inter-dependent ontologies, standardizing release practices and metadata and establishing shared quality standards are crucial to enable interoperability. The Ontology Development Kit (ODK) provides a set of standardized, customizable and automatically executable workflows, and packages all required tooling in a single Docker image. In this paper, we provide an overview of how the ODK works, show how it is used in practice and describe how we envision it driving standardization efforts in our community. Database URL: https://github.com/INCATools/ontology-development-kit.
Collapse
Affiliation(s)
| | - Damien Goutte-Gattat
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | - Shawn Zheng Kai Tan
- Samples Phenotypes and Ontologies Team (SPOT), European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - James P Balhoff
- RENCI, University of North Carolina, Chapel Hill, NC, North Carolina 27517, USA
| | - Seth Carbon
- Berkeley Bioinformatics Open-source Projects (BBOP), Lawrence Berkeley National Laboratory (LBNL), 1 Cyclotron Road, Mailstop 977-0257, Berkeley, CA 94720, USA
| | - Anita R Caron
- Samples Phenotypes and Ontologies Team (SPOT), European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - William D Duncan
- Berkeley Bioinformatics Open-source Projects (BBOP), Lawrence Berkeley National Laboratory (LBNL), 1 Cyclotron Road, Mailstop 977-0257, Berkeley, CA 94720, USA,College of Dentistry; Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, William D. Duncan: 1395 Center Dr, Gainesville, William R. Hogan: 1600 SW Archer Rd, Gainesville, FL 32610, USA
| | - Joe E Flack
- School of Medicine, Johns Hopkins University, 733 N Broadway, Baltimore, Baltimore, MD 21205, USA
| | - Melissa Haendel
- University of Colorado Anschutz Medical Campus, 13001 E 17th Pl, Aurora, CO 80045, USA
| | - Nomi L Harris
- Berkeley Bioinformatics Open-source Projects (BBOP), Lawrence Berkeley National Laboratory (LBNL), 1 Cyclotron Road, Mailstop 977-0257, Berkeley, CA 94720, USA
| | - William R Hogan
- College of Dentistry; Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, William D. Duncan: 1395 Center Dr, Gainesville, William R. Hogan: 1600 SW Archer Rd, Gainesville, FL 32610, USA
| | - Charles Tapley Hoyt
- Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Avenue Armenise Building Room 109, Boston, MA 02115, USA
| | - Rebecca C Jackson
- Bend Informatics LLC, 5305 RIVER RD NORTH, STE B, KEIZER, OR 97303, USA
| | | | - Huseyin Kir
- Samples Phenotypes and Ontologies Team (SPOT), European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Martin Larralde
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstraße 1, Heidelberg 69117, Germany
| | - Julie A McMurry
- University of Colorado Anschutz Medical Campus, 13001 E 17th Pl, Aurora, CO 80045, USA
| | | | - Bjoern Peters
- Institute for Allergy & Immunology, La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA
| | - Clare Pilgrim
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | - Ray Stefancsik
- Samples Phenotypes and Ontologies Team (SPOT), European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Sofia MC Robb
- Stowers Institute for Medical Research, 1000 E. 50th St., Kansas City, MO 64110, USA
| | - Sabrina Toro
- University of Colorado Anschutz Medical Campus, 13001 E 17th Pl, Aurora, CO 80045, USA
| | - Nicole A Vasilevsky
- University of Colorado Anschutz Medical Campus, 13001 E 17th Pl, Aurora, CO 80045, USA
| | - Ramona Walls
- Critical Path Institute, 1730 E River Road, Tucson, AZ 85718, USA
| | - Christopher J Mungall
- Berkeley Bioinformatics Open-source Projects (BBOP), Lawrence Berkeley National Laboratory (LBNL), 1 Cyclotron Road, Mailstop 977-0257, Berkeley, CA 94720, USA
| | | |
Collapse
|
14
|
Hemedan AA, Niarakis A, Schneider R, Ostaszewski M. Boolean modelling as a logic-based dynamic approach in systems medicine. Comput Struct Biotechnol J 2022; 20:3161-3172. [PMID: 35782730 PMCID: PMC9234349 DOI: 10.1016/j.csbj.2022.06.035] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 06/14/2022] [Accepted: 06/14/2022] [Indexed: 11/17/2022] Open
Abstract
Molecular mechanisms of health and disease are often represented as systems biology diagrams, and the coverage of such representation constantly increases. These static diagrams can be transformed into dynamic models, allowing for in silico simulations and predictions. Boolean modelling is an approach based on an abstract representation of the system. It emphasises the qualitative modelling of biological systems in which each biomolecule can take two possible values: zero for absent or inactive, one for present or active. Because of this approximation, Boolean modelling is applicable to large diagrams, allowing to capture their dynamic properties. We review Boolean models of disease mechanisms and compare a range of methods and tools used for analysis processes. We explain the methodology of Boolean analysis focusing on its application in disease modelling. Finally, we discuss its practical application in analysing signal transduction and gene regulatory pathways in health and disease.
Collapse
Affiliation(s)
- Ahmed Abdelmonem Hemedan
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Anna Niarakis
- Université Paris-Saclay, Laboratoire Européen de Recherche pour la Polyarthrite rhumatoïde – Genhotel, Univ Evry, Evry, France
- Lifeware Group, Inria, Saclay-île de France, 91120 Palaiseau, France
| | - Reinhard Schneider
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Marek Ostaszewski
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| |
Collapse
|
15
|
Fitzpatrick R, Stefan MI. Validation Through Collaboration: Encouraging Team Efforts to Ensure Internal and External Validity of Computational Models of Biochemical Pathways. Neuroinformatics 2022; 20:277-284. [PMID: 35543917 PMCID: PMC9537119 DOI: 10.1007/s12021-022-09584-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/17/2022] [Indexed: 01/09/2023]
Abstract
Computational modelling of biochemical reaction pathways is an increasingly important part of neuroscience research. In order to be useful, computational models need to be valid in two senses: First, they need to be consistent with experimental data and able to make testable predictions (external validity). Second, they need to be internally consistent and independently reproducible (internal validity). Here, we discuss both types of validity and provide a brief overview of tools and technologies used to ensure they are met. We also suggest the introduction of new collaborative technologies to ensure model validity: an incentivised experimental database for external validity and reproducibility audits for internal validity. Both rely on FAIR principles and on collaborative science practices.
Collapse
Affiliation(s)
- Richard Fitzpatrick
- Centre for Discovery Brain Sciences, University of Edinburgh, Edinburgh, UK ,School of Biological Sciences, University of Edinburgh, Edinburgh, UK
| | - Melanie I. Stefan
- Centre for Discovery Brain Sciences, University of Edinburgh, Edinburgh, UK ,ZJU-UoE Institute, Zhejiang University, Haining, China
| |
Collapse
|
16
|
Hatos A, Quaglia F, Piovesan D, Tosatto SCE. APICURON: a database to credit and acknowledge the work of biocurators. Database (Oxford) 2021; 2021:baab019. [PMID: 33882120 PMCID: PMC8060004 DOI: 10.1093/database/baab019] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 03/12/2021] [Accepted: 04/12/2021] [Indexed: 11/14/2022]
Abstract
APICURON is an open and freely accessible resource that tracks and credits the work of biocurators across multiple participating knowledgebases. Biocuration is essential to extract knowledge from research data and make it available in a structured and standardized way to the scientific community. However, processing biological data-mainly from literature-requires a huge effort that is difficult to attribute and quantify. APICURON collects biocuration events from third-party resources and aggregates this information, spotlighting biocurator contributions. APICURON promotes biocurator engagement implementing gamification concepts like badges, medals and leaderboards and at the same time provides a monitoring service for registered resources and for biocurators themselves. APICURON adopts a data model that is flexible enough to represent and track the majority of biocuration activities. Biocurators are identified through their Open Researcher and Contributor ID. The definition of curation events, scoring systems and rules for assigning badges and medals are resource-specific and easily customizable. Registered resources can transfer curation activities on the fly through a secure and robust Application Programming Interface (API). Here, we show how simple and effective it is to connect a resource to APICURON, describing the DisProt database of intrinsically disordered proteins as a use case. We believe APICURON will provide biological knowledgebases with a service to recognize and credit the effort of their biocurators, monitor their activity and promote curator engagement. Database URL: https://apicuron.org.
Collapse
Affiliation(s)
- András Hatos
- Department of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padova 35131, Italy
| | - Federica Quaglia
- Department of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padova 35131, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padova 35131, Italy
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padova 35131, Italy
| |
Collapse
|
17
|
Langenstein M, Hermjakob H, Llinares MB. A decoupled, modular and scriptable architecture for tools to curate data platforms. Bioinformatics 2021; 37:3693-3694. [PMID: 33830216 PMCID: PMC8545344 DOI: 10.1093/bioinformatics/btab233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Revised: 03/12/2021] [Accepted: 04/07/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Curation is essential for any data platform to maintain the quality of the data it provides. Today, more effective curation tools are often vital to keep up with the rapid growth of existing, maintenance-requiring databases and the amount of newly published information that needs to be surveyed. However, curation interfaces are often complex and challenging to be further developed. Therefore, opportunities for experimentation with curation workflows may be lost due to a lack of development resources or a reluctance to change sensitive production systems. RESULTS We propose a decoupled, modular and scriptable architecture to build new curation tools on top of existing platforms. Our architecture treats the existing platform as a black box. It therefore only relies on its public application programming interfaces (APIs) and web application instead of requiring any changes to the existing infrastructure. As a case study, we have implemented this architecture in cmd-iaso, a curation tool for the identifiers.org registry. With cmd-iaso, we also show that the proposed design's flexibility can be utilised to streamline and enhance the curator's workflow with the platform's existing web interface. AVAILABILITY The cmd-iaso curation tool is implemented in Python 3.7+ and supports Linux, macOS and Windows. Its source code and documentation are freely available from https://github.com/identifiers-org/cmd-iaso. It is also published as a Docker container at https://hub.docker.com/r/identifiersorg/cmd-iaso. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Momo Langenstein
- European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Genome Campus, Cambridge, UK CB10 1SD
| | - Henning Hermjakob
- European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Genome Campus, Cambridge, UK CB10 1SD
| | - Manuel Bernal Llinares
- European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Genome Campus, Cambridge, UK CB10 1SD
| |
Collapse
|
18
|
Thessen AE, Bogdan P, Patterson DJ, Casey TM, Hinojo-Hinojo C, de Lange O, Haendel MA. From Reductionism to Reintegration: Solving society's most pressing problems requires building bridges between data types across the life sciences. PLoS Biol 2021; 19:e3001129. [PMID: 33770077 PMCID: PMC7997011 DOI: 10.1371/journal.pbio.3001129] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Decades of reductionist approaches in biology have achieved spectacular progress, but the proliferation of subdisciplines, each with its own technical and social practices regarding data, impedes the growth of the multidisciplinary and interdisciplinary approaches now needed to address pressing societal challenges. Data integration is key to a reintegrated biology able to address global issues such as climate change, biodiversity loss, and sustainable ecosystem management. We identify major challenges to data integration and present a vision for a "Data as a Service"-oriented architecture to promote reuse of data for discovery. The proposed architecture includes standards development, new tools and services, and strategies for career-development and sustainability.
Collapse
Affiliation(s)
- Anne E. Thessen
- Department of Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, United States of America
- * E-mail:
| | - Paul Bogdan
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, United States of America
| | | | - Theresa M. Casey
- Department of Animal Sciences, Purdue University, West Lafayette, Indiana, United States of America
| | - César Hinojo-Hinojo
- Department of Earth System Science, University of California, Irvine, California, United States of America
| | - Orlando de Lange
- Department of Electrical Engineering, University of Washington, Seattle, Washington, United States of America
| | - Melissa A. Haendel
- Department of Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, United States of America
| |
Collapse
|
19
|
Bastian FB, Roux J, Niknejad A, Comte A, Fonseca Costa SS, de Farias TM, Moretti S, Parmentier G, de Laval VR, Rosikiewicz M, Wollbrett J, Echchiki A, Escoriza A, Gharib WH, Gonzales-Porta M, Jarosz Y, Laurenczy B, Moret P, Person E, Roelli P, Sanjeev K, Seppey M, Robinson-Rechavi M. The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals. Nucleic Acids Res 2021; 49:D831-D847. [PMID: 33037820 PMCID: PMC7778977 DOI: 10.1093/nar/gkaa793] [Citation(s) in RCA: 115] [Impact Index Per Article: 28.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 08/24/2020] [Accepted: 09/15/2020] [Indexed: 01/24/2023] Open
Abstract
Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced by integrating multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It is based exclusively on curated healthy wild-type expression data (e.g., no gene knock-out, no treatment, no disease), to provide a comparable reference of normal gene expression. Curation includes very large datasets such as GTEx (re-annotation of samples as ‘healthy’ or not) as well as many small ones. Data are integrated and made comparable between species thanks to consistent data annotation and processing, and to calls of presence/absence of expression, along with expression scores. As a result, Bgee is capable of detecting the conditions of expression of any single gene, accommodating any data type and species. Bgee provides several tools for analyses, allowing, e.g., automated comparisons of gene expression patterns within and between species, retrieval of the prefered conditions of expression of any gene, or enrichment analyses of conditions with expression of sets of genes. Bgee release 14.1 includes 29 animal species, and is available at https://bgee.org/ and through its Bioconductor R package BgeeDB.
Collapse
Affiliation(s)
- Frederic B Bastian
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Julien Roux
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Anne Niknejad
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Aurélie Comte
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Sara S Fonseca Costa
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Tarcisio Mendes de Farias
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Sébastien Moretti
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Gilles Parmentier
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Valentine Rech de Laval
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Marta Rosikiewicz
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Julien Wollbrett
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Amina Echchiki
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Angélique Escoriza
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Walid H Gharib
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Mar Gonzales-Porta
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Yohan Jarosz
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Balazs Laurenczy
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Philippe Moret
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Emilie Person
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Patrick Roelli
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Komal Sanjeev
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Mathieu Seppey
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Marc Robinson-Rechavi
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
20
|
Carey MA, Dräger A, Beber ME, Papin JA, Yurkovich JT. Community standards to facilitate development and address challenges in metabolic modeling. Mol Syst Biol 2020; 16:e9235. [PMID: 32845080 PMCID: PMC8411906 DOI: 10.15252/msb.20199235] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Standardization of data and models facilitates effective communication, especially in computational systems biology. However, both the development and consistent use of standards and resources remain challenging. As a result, the amount, quality, and format of the information contained within systems biology models are not consistent and therefore present challenges for widespread use and communication. Here, we focused on these standards, resources, and challenges in the field of constraint-based metabolic modeling by conducting a community-wide survey. We used this feedback to (i) outline the major challenges that our field faces and to propose solutions and (ii) identify a set of features that defines what a "gold standard" metabolic network reconstruction looks like concerning content, annotation, and simulation capabilities. We anticipate that this community-driven outline will help the long-term development of community-inspired resources as well as produce high-quality, accessible models within our field. More broadly, we hope that these efforts can serve as blueprints for other computational modeling communities to ensure the continued development of both practical, usable standards and reproducible, knowledge-rich models.
Collapse
Affiliation(s)
- Maureen A Carey
- Division of Infectious Diseases and International HealthDepartment of MedicineUniversity of VirginiaCharlottesvilleVAUSA
| | - Andreas Dräger
- Computational Systems Biology of Infection and Antimicrobial‐Resistant PathogensInstitute for Biomedical Informatics (IBMI)University of TübingenTübingenGermany
- Department of Computer ScienceUniversity of TübingenTübingenGermany
- German Center for Infection Research (DZIF), partner site TübingenTübingenGermany
| | - Moritz E Beber
- Novo Nordisk Foundation Center for BiosustainabilityTechnical University of DenmarkKemitorvetDenmark
| | - Jason A Papin
- Division of Infectious Diseases and International HealthDepartment of MedicineUniversity of VirginiaCharlottesvilleVAUSA
- Department of Biomedical EngineeringUniversity of VirginiaCharlottesvilleVAUSA
| | | |
Collapse
|