1
|
Duo H, Li Y, Lan Y, Tao J, Yang Q, Xiao Y, Sun J, Li L, Nie X, Zhang X, Liang G, Liu M, Hao Y, Li B. Systematic evaluation with practical guidelines for single-cell and spatially resolved transcriptomics data simulation under multiple scenarios. Genome Biol 2024; 25:145. [PMID: 38831386 PMCID: PMC11149245 DOI: 10.1186/s13059-024-03290-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 05/28/2024] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND Single-cell RNA sequencing (scRNA-seq) and spatially resolved transcriptomics (SRT) have led to groundbreaking advancements in life sciences. To develop bioinformatics tools for scRNA-seq and SRT data and perform unbiased benchmarks, data simulation has been widely adopted by providing explicit ground truth and generating customized datasets. However, the performance of simulation methods under multiple scenarios has not been comprehensively assessed, making it challenging to choose suitable methods without practical guidelines. RESULTS We systematically evaluated 49 simulation methods developed for scRNA-seq and/or SRT data in terms of accuracy, functionality, scalability, and usability using 152 reference datasets derived from 24 platforms. SRTsim, scDesign3, ZINB-WaVE, and scDesign2 have the best accuracy performance across various platforms. Unexpectedly, some methods tailored to scRNA-seq data have potential compatibility for simulating SRT data. Lun, SPARSim, and scDesign3-tree outperform other methods under corresponding simulation scenarios. Phenopath, Lun, Simple, and MFA yield high scalability scores but they cannot generate realistic simulated data. Users should consider the trade-offs between method accuracy and scalability (or functionality) when making decisions. Additionally, execution errors are mainly caused by failed parameter estimations and appearance of missing or infinite values in calculations. We provide practical guidelines for method selection, a standard pipeline Simpipe ( https://github.com/duohongrui/simpipe ; https://doi.org/10.5281/zenodo.11178409 ), and an online tool Simsite ( https://www.ciblab.net/software/simshiny/ ) for data simulation. CONCLUSIONS No method performs best on all criteria, thus a good-yet-not-the-best method is recommended if it solves problems effectively and reasonably. Our comprehensive work provides crucial insights for developers on modeling gene expression data and fosters the simulation process for users.
Collapse
Affiliation(s)
- Hongrui Duo
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Yinghong Li
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, People's Republic of China
| | - Yang Lan
- Institute of Pathology and Southwest Cancer Center, Southwest Hospital, Army Medical University, Chongqing, 400038, People's Republic of China
| | - Jingxin Tao
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Qingxia Yang
- Zhejiang Provincial Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou, 310058, People's Republic of China
| | - Yingxue Xiao
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Jing Sun
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Lei Li
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Xiner Nie
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, People's Republic of China
| | - Xiaoxi Zhang
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Guizhao Liang
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, People's Republic of China
| | - Mingwei Liu
- Key Laboratory of Clinical Laboratory Diagnostics, College of Laboratory Medicine, Chongqing Medical University, Chongqing, 400016, People's Republic of China
| | - Youjin Hao
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China.
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China.
| |
Collapse
|
2
|
Bai J, Kamatchinathan S, Kundu DJ, Bandla C, Vizcaíno JA, Perez-Riverol Y. Open-source large language models in action: A bioinformatics chatbot for PRIDE database. Proteomics 2024:e2400005. [PMID: 38556628 DOI: 10.1002/pmic.202400005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 03/08/2024] [Accepted: 03/20/2024] [Indexed: 04/02/2024]
Abstract
We here present a chatbot assistant infrastructure (https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (mistral), and openhermes. It also includes a web service API (Application Programming Interface), web interface, and components for indexing and managing vector databases. An Elo-ranking system-based benchmark component is included in the framework as well, which allows for evaluating the performance of each LLM and for improving PRIDE documentation. The chatbot not only allows users to interact with PRIDE documentation but can also be used to search and find PRIDE datasets using an LLM-based recommendation system, enabling dataset discoverability. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure. The framework is open-source (https://github.com/PRIDE-Archive/pride-chatbot).
Collapse
Affiliation(s)
- Jingwen Bai
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | - Selvakumar Kamatchinathan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | - Deepti J Kundu
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | - Chakradhar Bandla
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | - Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| |
Collapse
|
3
|
Loreto ELS, Melo ESD, Wallau GL, Gomes TMFF. The good, the bad and the ugly of transposable elements annotation tools. Genet Mol Biol 2024; 46:e20230138. [PMID: 38373163 PMCID: PMC10876081 DOI: 10.1590/1678-4685-gmb-2023-0138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Accepted: 11/26/2023] [Indexed: 02/21/2024] Open
Abstract
Transposable elements are repetitive and mobile DNA segments that can be found in virtually all organisms investigated to date. Their complex structure and variable nature are particularly challenging from the genomic annotation point of view. Many softwares have been developed to automate and facilitate TEs annotation at the genomic level, but they are highly heterogeneous regarding documentation, usability and methods. In this review, we revisited the existing software for TE genomic annotation, concentrating on the most often used ones, the methodologies they apply, and usability. Building on the state of the art of TE annotation software we propose best practices and highlight the strengths and weaknesses from the available solutions.
Collapse
Affiliation(s)
- Elgion L S Loreto
- Universidade Federal do Rio Grande do Sul, Programa de Pós-Graduação em Genética e Biologia Molecular, Porto Alegre, RS, Brazil
- Universidade Federal de Santa Maria, Departamento de Bioquímica e Biologia Molecular, Santa Maria, RS, Brazil
| | - Elverson S de Melo
- Fundação Oswaldo Cruz, Instituto Aggeu Magalhães, Departamento de Entomologia, Recife, PE, Brazil
| | - Gabriel L Wallau
- Fundação Oswaldo Cruz, Instituto Aggeu Magalhães, Departamento de Entomologia, Recife, PE, Brazil
| | - Tiago M F F Gomes
- Universidade Federal do Rio Grande do Sul, Programa de Pós-Graduação em Genética e Biologia Molecular, Porto Alegre, RS, Brazil
| |
Collapse
|
4
|
Chicco D, Spolaor S, Nobile MS. Ten quick tips for fuzzy logic modeling of biomedical systems. PLoS Comput Biol 2023; 19:e1011700. [PMID: 38127800 PMCID: PMC10734980 DOI: 10.1371/journal.pcbi.1011700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2023] Open
Abstract
Fuzzy logic is useful tool to describe and represent biological or medical scenarios, where often states and outcomes are not only completely true or completely false, but rather partially true or partially false. Despite its usefulness and spread, fuzzy logic modeling might easily be done in the wrong way, especially by beginners and unexperienced researchers, who might overlook some important aspects or might make common mistakes. Malpractices and pitfalls, in turn, can lead to wrong or overoptimistic, inflated results, with negative consequences to the biomedical research community trying to comprehend a particular phenomenon, or even to patients suffering from the investigated disease. To avoid common mistakes, we present here a list of quick tips for fuzzy logic modeling any biomedical scenario: some guidelines which should be taken into account by any fuzzy logic practitioner, including experts. We believe our best practices can have a strong impact in the scientific community, allowing researchers who follow them to obtain better, more reliable results and outcomes in biomedical contexts.
Collapse
Affiliation(s)
- Davide Chicco
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
- Dipartimento di Informatica Sistemistica e Comunicazione, Università di Milano-Bicocca, Milan, Italy
| | - Simone Spolaor
- Microsystems, Eindhoven University of Technology, Eindhoven, the Netherlands
| | - Marco S. Nobile
- Department of Environmental Sciences Informatics and Statistics, Ca’ Foscari University of Venice, Venice, Italy
| |
Collapse
|
5
|
Lubiana T, Lopes R, Medeiros P, Silva JC, Goncalves ANA, Maracaja-Coutinho V, Nakaya HI. Ten quick tips for harnessing the power of ChatGPT in computational biology. PLoS Comput Biol 2023; 19:e1011319. [PMID: 37561669 PMCID: PMC10414555 DOI: 10.1371/journal.pcbi.1011319] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2023] Open
Affiliation(s)
- Tiago Lubiana
- School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
| | - Rafael Lopes
- Department of Epidemiology of Microbial Diseases and Public Health Modeling Unit, Yale School of Public Health, New Haven, Connecticut, United States of America
| | | | - Juan Carlo Silva
- School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
| | | | - Vinicius Maracaja-Coutinho
- Advanced Center for Chronic Diseases, Universidad de Chile, Santiago, Chile
- Centro de Modelamiento Molecular, Biofísica y Bioinformática—CM2B2, Facultad de Ciencias Químicas y Farmacéuticas, Universidad de Chile, Santiago, Chile
- ANID Anillo ACT210004 SYSTEMIX, Rancagua, Chile
- Anillo Inflammation in HIV/AIDS—InflammAIDS, Santiago, Chile
- Beagle Bioinformatics, São Paulo, Brasil & Santiago, Chile
| | - Helder I. Nakaya
- School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
- Hospital Israelita Albert Einstein, São Paulo, Brazil
| |
Collapse
|
6
|
Chicco D, Cumbo F, Angione C. Ten quick tips for avoiding pitfalls in multi-omics data integration analyses. PLoS Comput Biol 2023; 19:e1011224. [PMID: 37410704 DOI: 10.1371/journal.pcbi.1011224] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/08/2023] Open
Abstract
Data are the most important elements of bioinformatics: Computational analysis of bioinformatics data, in fact, can help researchers infer new knowledge about biology, chemistry, biophysics, and sometimes even medicine, influencing treatments and therapies for patients. Bioinformatics and high-throughput biological data coming from different sources can even be more helpful, because each of these different data chunks can provide alternative, complementary information about a specific biological phenomenon, similar to multiple photos of the same subject taken from different angles. In this context, the integration of bioinformatics and high-throughput biological data gets a pivotal role in running a successful bioinformatics study. In the last decades, data originating from proteomics, metabolomics, metagenomics, phenomics, transcriptomics, and epigenomics have been labelled -omics data, as a unique name to refer to them, and the integration of these omics data has gained importance in all biological areas. Even if this omics data integration is useful and relevant, due to its heterogeneity, it is not uncommon to make mistakes during the integration phases. We therefore decided to present these ten quick tips to perform an omics data integration correctly, avoiding common mistakes we experienced or noticed in published studies in the past. Even if we designed our ten guidelines for beginners, by using a simple language that (we hope) can be understood by anyone, we believe our ten recommendations should be taken into account by all the bioinformaticians performing omics data integration, including experts.
Collapse
Affiliation(s)
- Davide Chicco
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Fabio Cumbo
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio, United States of America
| | - Claudio Angione
- School of Computing Engineering and Digital Technologies, Teesside University, Middlesbrough, United Kingdom
| |
Collapse
|
7
|
Chicco D, Ferraro Petrillo U, Cattaneo G. Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment. PLoS Comput Biol 2023; 19:e1011272. [PMID: 37471333 PMCID: PMC10358940 DOI: 10.1371/journal.pcbi.1011272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/22/2023] Open
Abstract
Some scientific studies involve huge amounts of bioinformatics data that cannot be analyzed on personal computers usually employed by researchers for day-to-day activities but rather necessitate effective computational infrastructures that can work in a distributed way. For this purpose, distributed computing systems have become useful tools to analyze large amounts of bioinformatics data and to generate relevant results on virtual environments, where software can be executed for hours or even days without affecting the personal computer or laptop of a researcher. Even if distributed computing resources have become pivotal in multiple bioinformatics laboratories, often researchers and students use them in the wrong ways, making mistakes that can cause the distributed computers to underperform or that can even generate wrong outcomes. In this context, we present here ten quick tips for the usage of Apache Spark distributed computing systems for bioinformatics analyses: ten simple guidelines that, if taken into account, can help users avoid common mistakes and can help them run their bioinformatics analyses smoothly. Even if we designed our recommendations for beginners and students, they should be followed by experts too. We think our quick tips can help anyone make use of Apache Spark distributed computing systems more efficiently and ultimately help generate better, more reliable scientific results.
Collapse
Affiliation(s)
- Davide Chicco
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | | | - Giuseppe Cattaneo
- Dipartimento di Informatica, Università di Salerno, Fisciano (Salerno), Italy
| |
Collapse
|
8
|
Greenberg ZF, Graim KS, He M. Towards artificial intelligence-enabled extracellular vesicle precision drug delivery. Adv Drug Deliv Rev 2023:114974. [PMID: 37356623 DOI: 10.1016/j.addr.2023.114974] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 06/21/2023] [Accepted: 06/22/2023] [Indexed: 06/27/2023]
Abstract
Extracellular Vesicles (EVs), particularly exosomes, recently exploded into nanomedicine as an emerging drug delivery approach due to their superior biocompatibility, circulating stability, and bioavailability in vivo. However, EV heterogeneity makes molecular targeting precision a critical challenge. Deciphering key molecular drivers for controlling EV tissue targeting specificity is in great need. Artificial intelligence (AI) brings powerful prediction ability for guiding the rational design of engineered EVs in precision control for drug delivery. This review focuses on cutting-edge nano-delivery via integrating large-scale EV data with AI to develop AI-directed EV therapies and illuminate the clinical translation potential. We briefly review the current status of EVs in drug delivery, including the current frontier, limitations, and considerations to advance the field. Subsequently, we detail the future of AI in drug delivery and its impact on precision EV delivery. Our review discusses the current universal challenge of standardization and critical considerations when using AI combined with EVs for precision drug delivery. Finally, we will conclude this review with a perspective on future clinical translation led by a combined effort of AI and EV research.
Collapse
Affiliation(s)
- Zachary F Greenberg
- Department of Pharmaceutics, College of Pharmacy, University of Florida, Gainesville, Florida, 32610, USA
| | - Kiley S Graim
- Department of Computer & Information Science & Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, Florida, 32610, USA
| | - Mei He
- Department of Pharmaceutics, College of Pharmacy, University of Florida, Gainesville, Florida, 32610, USA.
| |
Collapse
|
9
|
Du X, Dastmalchi F, Ye H, Garrett TJ, Diller MA, Liu M, Hogan WR, Brochhausen M, Lemas DJ. Evaluating LC-HRMS metabolomics data processing software using FAIR principles for research software. Metabolomics 2023; 19:11. [PMID: 36745241 DOI: 10.1007/s11306-023-01974-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Accepted: 01/20/2023] [Indexed: 02/07/2023]
Abstract
BACKGROUND Liquid chromatography-high resolution mass spectrometry (LC-HRMS) is a popular approach for metabolomics data acquisition and requires many data processing software tools. The FAIR Principles - Findability, Accessibility, Interoperability, and Reusability - were proposed to promote open science and reusable data management, and to maximize the benefit obtained from contemporary and formal scholarly digital publishing. More recently, the FAIR principles were extended to include Research Software (FAIR4RS). AIM OF REVIEW This study facilitates open science in metabolomics by providing an implementation solution for adopting FAIR4RS in the LC-HRMS metabolomics data processing software. We believe our evaluation guidelines and results can help improve the FAIRness of research software. KEY SCIENTIFIC CONCEPTS OF REVIEW We evaluated 124 LC-HRMS metabolomics data processing software obtained from a systematic review and selected 61 software for detailed evaluation using FAIR4RS-related criteria, which were extracted from the literature along with internal discussions. We assigned each criterion one or more FAIR4RS categories through discussion. The minimum, median, and maximum percentages of criteria fulfillment of software were 21.6%, 47.7%, and 71.8%. Statistical analysis revealed no significant improvement in FAIRness over time. We identified four criteria covering multiple FAIR4RS categories but had a low %fulfillment: (1) No software had semantic annotation of key information; (2) only 6.3% of evaluated software were registered to Zenodo and received DOIs; (3) only 14.5% of selected software had official software containerization or virtual machine; (4) only 16.7% of evaluated software had a fully documented functions in code. According to the results, we discussed improvement strategies and future directions.
Collapse
Affiliation(s)
- Xinsong Du
- Department of Health Outcomes and Biomedical Informatics, University of Florida College of Medicine, Gainesville, FL, USA
| | - Farhad Dastmalchi
- Department of Health Outcomes and Biomedical Informatics, University of Florida College of Medicine, Gainesville, FL, USA
| | - Hao Ye
- Health Science Center Libraries, University of Florida, Florida, USA
| | - Timothy J Garrett
- Department of Pathology, Immunology and Laboratory Medicine, College of Medicine, University of Florida, Florida, USA
| | - Matthew A Diller
- Department of Health Outcomes and Biomedical Informatics, University of Florida College of Medicine, Gainesville, FL, USA
| | - Mei Liu
- Department of Health Outcomes and Biomedical Informatics, University of Florida College of Medicine, Gainesville, FL, USA
| | - William R Hogan
- Department of Health Outcomes and Biomedical Informatics, University of Florida College of Medicine, Gainesville, FL, USA
| | - Mathias Brochhausen
- Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, USA
| | - Dominick J Lemas
- Department of Health Outcomes and Biomedical Informatics, University of Florida College of Medicine, Gainesville, FL, USA.
- Department of Obstetrics and Gynecology, University of Florida College of Medicine, Florida, Gainesville, United States.
- Center for Perinatal Outcomes Research, University of Florida College of Medicine, Gainesville, United States.
| |
Collapse
|
10
|
Abstract
Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call "feature" a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.
Collapse
Affiliation(s)
- Davide Chicco
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
- * E-mail:
| | - Luca Oneto
- Dipartimento di Informatica Bioingegneria Robotica e Ingegneria dei Sistemi, Università di Genova, Genoa, Italy
- ZenaByte S.r.l., Genoa, Italy
| | - Erica Tavazzi
- Dipartimento di Ingegneria dell’Informazione, Università di Padova, Padua, Italy
| |
Collapse
|
11
|
Abstract
Pathway enrichment analysis (PEA) is a computational biology method that identifies biological functions that are overrepresented in a group of genes more than would be expected by chance and ranks these functions by relevance. The relative abundance of genes pertinent to specific pathways is measured through statistical methods, and associated functional pathways are retrieved from online bioinformatics databases. In the last decade, along with the spread of the internet, higher availability of computational resources made PEA software tools easy to access and to use for bioinformatics practitioners worldwide. Although it became easier to use these tools, it also became easier to make mistakes that could generate inflated or misleading results, especially for beginners and inexperienced computational biologists. With this article, we propose nine quick tips to avoid common mistakes and to out a complete, sound, thorough PEA, which can produce relevant and robust results. We describe our nine guidelines in a simple way, so that they can be understood and used by anyone, including students and beginners. Some tips explain what to do before starting a PEA, others are suggestions of how to correctly generate meaningful results, and some final guidelines indicate some useful steps to properly interpret PEA results. Our nine tips can help users perform better pathway enrichment analyses and eventually contribute to a better understanding of current biology.
Collapse
|
12
|
Usability evaluation of circRNA identification tools: Development of a heuristic-based framework and analysis. Comput Biol Med 2022; 147:105785. [PMID: 35780604 DOI: 10.1016/j.compbiomed.2022.105785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 05/23/2022] [Accepted: 06/26/2022] [Indexed: 11/21/2022]
Abstract
BACKGROUND AND OBJECTIVE Circular RNAs (circRNAs) are endogenous molecules of non-coding RNA that form a covalently closed loop at the 3' and 5' ends. Recently, the role of these molecules in the regulation of gene expression and their involvement in several human pathologies has gained notoriety. The identification of circRNAs is highly dependent on computational methods for analyzing RNA sequencing data. However, bioinformatics software is known to be problematic in terms of usability. Evidence points out that tools for identifying circRNAs can have such problems, negatively impacting researchers in this field. Here we present a heuristic-based framework for evaluating the usability of command-line circRNA identification software. METHODS We used heuristics evaluation to comprehensively identify the usability issues in a sample of circRNA identification tools. RESULTS We identified 46 usability issues presented individually in four tools. Most of the issues had cosmetic or minor severity. These are unlikely to challenge experienced users but may cause inconvenience for novice users. We also identified severe issues with the potential to harm users regardless of their experience. The areas most affected were the documentation and the installability of the tools. CONCLUSIONS With the proposed framework, we formally describe, for the first time, the usability problems that can affect users in this area of circRNA research. We hope that our framework can help researchers evaluate their software's usability during development.
Collapse
|
13
|
Niu YN, Roberts EG, Denisko D, Hoffman MM. Assessing and assuring interoperability of a genomics file format. Bioinformatics 2022; 38:3327-3336. [PMID: 35575355 PMCID: PMC9237710 DOI: 10.1093/bioinformatics/btac327] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Revised: 03/30/2022] [Accepted: 05/11/2022] [Indexed: 12/01/2022] Open
Abstract
Motivation Bioinformatics software tools operate largely through the use of specialized genomics file formats. Often these formats lack formal specification, making it difficult or impossible for the creators of these tools to robustly test them for correct handling of input and output. This causes problems in interoperability between different tools that, at best, wastes time and frustrates users. At worst, interoperability issues could lead to undetected errors in scientific results. Results We developed a new verification system, Acidbio, which tests for correct behavior in bioinformatics software packages. We crafted tests to unify correct behavior when tools encounter various edge cases—potentially unexpected inputs that exemplify the limits of the format. To analyze the performance of existing software, we tested the input validation of 80 Bioconda packages that parsed the Browser Extensible Data (BED) format. We also used a fuzzing approach to automatically perform additional testing. Of 80 software packages examined, 75 achieved less than 70% correctness on our test suite. We categorized multiple root causes for the poor performance of different types of software. Fuzzing detected other errors that the manually designed test suite could not. We also created a badge system that developers can use to indicate more precisely which BED variants their software accepts and to advertise the software’s performance on the test suite. Availability and implementation Acidbio is available at https://github.com/hoffmangroup/acidbio. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yi Nian Niu
- Princess Margaret Cancer Centre University Health Network, Toronto, ON, M5G 2C1, Canada
| | - Eric G Roberts
- Princess Margaret Cancer Centre University Health Network, Toronto, ON, M5G 2C1, Canada
| | - Danielle Denisko
- Princess Margaret Cancer Centre University Health Network, Toronto, ON, M5G 2C1, Canada.,Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada
| | - Michael M Hoffman
- Princess Margaret Cancer Centre University Health Network, Toronto, ON, M5G 2C1, Canada.,Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada.,Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada.,Vector Institute, Toronto, ON, M5G 1M1, Canada
| |
Collapse
|
14
|
Hermann S, Fehr J. Documenting research software in engineering science. Sci Rep 2022; 12:6567. [PMID: 35449149 PMCID: PMC9023583 DOI: 10.1038/s41598-022-10376-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Accepted: 04/05/2022] [Indexed: 11/09/2022] Open
Abstract
The reuse of research software needs good documentation, however, the documentation in particular is often criticized. Especially in non-IT specific disciplines, the lack of documentation is attributed to the lack of training, the lack of time or missing rewards. This article addresses the hypothesis that scientists do document but do not know exactly what they need to document, why, and for whom. In order to evaluate the actual documentation practice of research software, we examined existing recommendations, and we evaluated their implementation in everyday practice using a concrete example from the engineering sciences and compared the findings with best practice examples. To get a broad overview of what documentation of research software entailed, we defined categories and used them to conduct the research. Our results show that the big picture of what documentation of research software means is missing. Recommendations do not consider the important role of researchers, who write research software, whose documentation takes mainly place in their research articles. Moreover, we show that research software always has a history that influences the documentation.
Collapse
Affiliation(s)
- Sibylle Hermann
- Institute of Engineering and Computational Mechanics, University of Stuttgart, 70569, Stuttgart, Germany. .,Cluster of Excellence SimTech, University of Stuttgart, 70569, Stuttgart, Germany.
| | - Jörg Fehr
- Institute of Engineering and Computational Mechanics, University of Stuttgart, 70569, Stuttgart, Germany
| |
Collapse
|
15
|
van der Putten BCL, Mendes CI, Talbot BM, de Korne-Elenbaas J, Mamede R, Vila-Cerqueira P, Coelho LP, Gulvik CA, Katz LS, The Asm Ngs Hackathon Participants. Software testing in microbial bioinformatics: a call to action. Microb Genom 2022; 8. [PMID: 35259087 PMCID: PMC9176277 DOI: 10.1099/mgen.0.000790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Computational algorithms have become an essential component of research, with great efforts by the scientific community to raise standards on development and distribution of code. Despite these efforts, sustainability and reproducibility are major issues since continued validation through software testing is still not a widely adopted practice. Here, we report seven recommendations that help researchers implement software testing in microbial bioinformatics. We have developed these recommendations based on our experience from a collaborative hackathon organised prior to the American Society for Microbiology Next Generation Sequencing (ASM NGS) 2020 conference. We also present a repository hosting examples and guidelines for testing, available from https://github.com/microbinfie-hackathon2020/CSIS.
Collapse
Affiliation(s)
- Boas C L van der Putten
- Department of Medical Microbiology, Amsterdam UMC, University of Amsterdam, the Netherlands.,Department of Global Health, Amsterdam Institute for Global Health and Development, Amsterdam UMC, University of Amsterdam, the Netherlands
| | - C I Mendes
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisboa, Portugal
| | - Brooke M Talbot
- Department of Biological and Biomedical Sciences, Emory University, Atlanta, GA, USA
| | - Jolinda de Korne-Elenbaas
- Department of Medical Microbiology, Amsterdam UMC, University of Amsterdam, the Netherlands.,Department of Infectious Diseases, Public Health Laboratory, Public Health Service of Amsterdam, the Netherlands
| | - Rafael Mamede
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisboa, Portugal
| | - Pedro Vila-Cerqueira
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisboa, Portugal
| | - Luis Pedro Coelho
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, PR China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, PR China
| | - Christopher A Gulvik
- Bacterial Special Pathogens Branch, Division of High-Consequence Pathogens and Pathology, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Lee S Katz
- Center for Food Safety, University of Georgia, Griffin, GA, USA.,Enteric Diseases Laboratory Branch, Division of Foodborne, Waterborne, and Environmental Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | | |
Collapse
|
16
|
Noor A. Improving bioinformatics software quality through incorporation of software engineering practices. PeerJ Comput Sci 2022; 8:e839. [PMID: 35111923 PMCID: PMC8771759 DOI: 10.7717/peerj-cs.839] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 12/13/2021] [Indexed: 06/14/2023]
Abstract
BACKGROUND Bioinformatics software is developed for collecting, analyzing, integrating, and interpreting life science datasets that are often enormous. Bioinformatics engineers often lack the software engineering skills necessary for developing robust, maintainable, reusable software. This study presents review and discussion of the findings and efforts made to improve the quality of bioinformatics software. METHODOLOGY A systematic review was conducted of related literature that identifies core software engineering concepts for improving bioinformatics software development: requirements gathering, documentation, testing, and integration. The findings are presented with the aim of illuminating trends within the research that could lead to viable solutions to the struggles faced by bioinformatics engineers when developing scientific software. RESULTS The findings suggest that bioinformatics engineers could significantly benefit from the incorporation of software engineering principles into their development efforts. This leads to suggestion of both cultural changes within bioinformatics research communities as well as adoption of software engineering disciplines into the formal education of bioinformatics engineers. Open management of scientific bioinformatics development projects can result in improved software quality through collaboration amongst both bioinformatics engineers and software engineers. CONCLUSIONS While strides have been made both in identification and solution of issues of particular import to bioinformatics software development, there is still room for improvement in terms of shifts in both the formal education of bioinformatics engineers as well as the culture and approaches of managing scientific bioinformatics research and development efforts.
Collapse
|
17
|
Heil BJ, Hoffman MM, Markowetz F, Lee SI, Greene CS, Hicks SC. Reproducibility standards for machine learning in the life sciences. Nat Methods 2021; 18:1132-1135. [PMID: 34462593 PMCID: PMC9131851 DOI: 10.1038/s41592-021-01256-7] [Citation(s) in RCA: 60] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
To make machine learning analyses in the life sciences more computationally reproducible, we propose standards based on data, model, and code publication, programming best practices, and workflow automation. By meeting these standards, the community of researchers applying machine learning methods in the life sciences can ensure that their analyses are worthy of trust. this article has been peer reviewed.
Collapse
Affiliation(s)
- Benjamin J Heil
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Michael M Hoffman
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Florian Markowetz
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK
| | - Su-In Lee
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Casey S Greene
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, USA.
- Center for Health AI, University of Colorado School of Medicine, Aurora, CO, USA.
| | - Stephanie C Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA.
| |
Collapse
|
18
|
Wratten L, Wilm A, Göke J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 2021; 18:1161-1168. [PMID: 34556866 DOI: 10.1038/s41592-021-01254-9] [Citation(s) in RCA: 39] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 07/29/2021] [Indexed: 02/08/2023]
Abstract
The rapid growth of high-throughput technologies has transformed biomedical research. With the increasing amount and complexity of data, scalability and reproducibility have become essential not just for experiments, but also for computational analysis. However, transforming data into information involves running a large number of tools, optimizing parameters, and integrating dynamically changing reference data. Workflow managers were developed in response to such challenges. They simplify pipeline development, optimize resource usage, handle software installation and versions, and run on different compute platforms, enabling workflow portability and sharing. In this Perspective, we highlight key features of workflow managers, compare commonly used approaches for bioinformatics workflows, and provide a guide for computational and noncomputational users. We outline community-curated pipeline initiatives that enable novice and experienced users to perform complex, best-practice analyses without having to manually assemble workflows. In sum, we illustrate how workflow managers contribute to making computational analysis in biomedical research shareable, scalable, and reproducible.
Collapse
Affiliation(s)
| | | | - Jonathan Göke
- Genome Institute of Singapore, Singapore, Singapore.
| |
Collapse
|
19
|
Chang HY, Colby SM, Du X, Gomez JD, Helf MJ, Kechris K, Kirkpatrick CR, Li S, Patti GJ, Renslow RS, Subramaniam S, Verma M, Xia J, Young JD. A Practical Guide to Metabolomics Software Development. Anal Chem 2021; 93:1912-1923. [PMID: 33467846 PMCID: PMC7859930 DOI: 10.1021/acs.analchem.0c03581] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
![]()
A growing number
of software tools have been developed for metabolomics
data processing and analysis. Many new tools are contributed by metabolomics
practitioners who have limited prior experience with software development,
and the tools are subsequently implemented by users with expertise
that ranges from basic point-and-click data analysis to advanced coding.
This Perspective is intended to introduce metabolomics software users
and developers to important considerations that determine the overall
impact of a publicly available tool within the scientific community.
The recommendations reflect the collective experience of an NIH-sponsored
Metabolomics Consortium working group that was formed with the goal
of researching guidelines and best practices for metabolomics tool
development. The recommendations are aimed at metabolomics researchers
with little formal background in programming and are organized into
three stages: (i) preparation, (ii) tool development, and (iii) distribution
and maintenance.
Collapse
Affiliation(s)
- Hui-Yin Chang
- Department of Pathology, University of Michigan, 1301 Catherine Street, Ann Arbor, Michigan 48109, United States.,Department of Biomedical Sciences and Engineering, National Central University, No. 300, Zhongda Road, Zhongli District, Taoyuan City 320, Taiwan
| | - Sean M Colby
- Biological Sciences Division, Pacific Northwest National Laboratory, P.O. Box 999, MSIN: K8-98, Richland, Washington 99352, United States
| | - Xiuxia Du
- Department of Bioinformatics & Genomics, University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, North Carolina 28223, United States
| | - Javier D Gomez
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, PMB 351604, 2301 Vanderbilt Place, Nashville, Tennessee 37235, United States
| | - Maximilian J Helf
- Boyce Thompson Institute and Department of Chemistry and Chemical Biology, Cornell University, 533 Tower Road, Ithaca, New York 14853, United States
| | - Katerina Kechris
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, 13001 East 17th Place B119, Aurora, Colorado 80045, United States
| | - Christine R Kirkpatrick
- San Diego Supercomputer Center, University of California San Diego, MC 0505, 9500 Gilman Drive, La Jolla, California 92093, United States
| | - Shuzhao Li
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, Connecticut 06032, United States
| | - Gary J Patti
- Department of Chemistry, Department of Medicine, and Siteman Cancer Center, Washington University in St. Louis, CB 1134, One Brookings Drive, St. Louis, Missouri 63130, United States
| | - Ryan S Renslow
- Biological Sciences Division, Pacific Northwest National Laboratory, P.O. Box 999, MSIN: K8-98, Richland, Washington 99352, United States.,Gene and Linda Voiland School of Chemical Engineering and Bioengineering, Washington State University, P.O. Box 646515, Pullman, Washington 99164, United States
| | - Shankar Subramaniam
- San Diego Supercomputer Center, University of California San Diego, MC 0505, 9500 Gilman Drive, La Jolla, California 92093, United States.,Department of Bioengineering, Department of Computer Science and Engineering, Department of Cellular and Molecular Medicine, and Department of Chemistry and Biochemistry, University of California San Diego, 9500 Gilman Drive #0412, La Jolla, California 92093, United States
| | - Mukesh Verma
- Epidemiology and Genomics Research Program, National Cancer Institute, National Institutes of Health, Suite 4E102, 9609 Medical Center Drive, MSC 9763, Rockville, Maryland 20850, United States
| | - Jianguo Xia
- Faculty of Agricultural and Environmental Sciences, McGill University, 21111 Lakeshore Road, Ste. Anne de Bellevue, Quebec H9X 3 V9, Canada
| | - Jamey D Young
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, PMB 351604, 2301 Vanderbilt Place, Nashville, Tennessee 37235, United States.,Department of Molecular Physiology and Biophysics, Vanderbilt University, PMB 351604, 2301 Vanderbilt Place, Nashville, Tennessee 37235, United States
| |
Collapse
|
20
|
Maguire F, Jia B, Gray KL, Lau WYV, Beiko RG, Brinkman FSL. Metagenome-assembled genome binning methods with short reads disproportionately fail for plasmids and genomic Islands. Microb Genom 2020; 6:mgen000436. [PMID: 33001022 PMCID: PMC7660262 DOI: 10.1099/mgen.0.000436] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2020] [Accepted: 09/04/2020] [Indexed: 12/12/2022] Open
Abstract
Metagenomic methods enable the simultaneous characterization of microbial communities without time-consuming and bias-inducing culturing. Metagenome-assembled genome (MAG) binning methods aim to reassemble individual genomes from this data. However, the recovery of mobile genetic elements (MGEs), such as plasmids and genomic islands (GIs), by binning has not been well characterized. Given the association of antimicrobial resistance (AMR) genes and virulence factor (VF) genes with MGEs, studying their transmission is a public-health priority. The variable copy number and sequence composition of MGEs makes them potentially problematic for MAG binning methods. To systematically investigate this issue, we simulated a low-complexity metagenome comprising 30 GI-rich and plasmid-containing bacterial genomes. MAGs were then recovered using 12 current prediction pipelines and evaluated. While 82-94 % of chromosomes could be correctly recovered and binned, only 38-44 % of GIs and 1-29 % of plasmid sequences were found. Strikingly, no plasmid-borne VF nor AMR genes were recovered, and only 0-45 % of AMR or VF genes within GIs. We conclude that short-read MAG approaches, without further optimization, are largely ineffective for the analysis of mobile genes, including those of public-health importance, such as AMR and VF genes. We propose that researchers should explore developing methods that optimize for this issue and consider also using unassembled short reads and/or long-read approaches to more fully characterize metagenomic data.
Collapse
Affiliation(s)
- Finlay Maguire
- Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, Nova Scotia, B3H 4R2, Canada
| | - Baofeng Jia
- Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, Canada
| | - Kristen L. Gray
- Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, Canada
| | - Wing Yin Venus Lau
- Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, Canada
| | - Robert G. Beiko
- Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, Nova Scotia, B3H 4R2, Canada
| | - Fiona S. L. Brinkman
- Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, Canada
| |
Collapse
|
21
|
Carper DL, Lawrence TJ, Carrell AA, Pelletier DA, Weston DJ. DISCo-microbe: design of an identifiable synthetic community of microbes. PeerJ 2020; 8:e8534. [PMID: 32149021 PMCID: PMC7049465 DOI: 10.7717/peerj.8534] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 01/08/2020] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Microbiomes are extremely important for their host organisms, providing many vital functions and extending their hosts' phenotypes. Natural studies of host-associated microbiomes can be difficult to interpret due to the high complexity of microbial communities, which hinders our ability to track and identify individual members along with the many factors that structure or perturb those communities. For this reason, researchers have turned to synthetic or constructed communities in which the identities of all members are known. However, due to the lack of tracking methods and the difficulty of creating a more diverse and identifiable community that can be distinguished through next-generation sequencing, most such in vivo studies have used only a few strains. RESULTS To address this issue, we developed DISCo-microbe, a program for the design of an identifiable synthetic community of microbes for use in in vivo experimentation. The program is composed of two modules; (1) create, which allows the user to generate a highly diverse community list from an input DNA sequence alignment using a custom nucleotide distance algorithm, and (2) subsample, which subsamples the community list to either represent a number of grouping variables, including taxonomic proportions, or to reach a user-specified maximum number of community members. As an example, we demonstrate the generation of a synthetic microbial community that can be distinguished through amplicon sequencing. The synthetic microbial community in this example consisted of 2,122 members from a starting DNA sequence alignment of 10,000 16S rRNA sequences from the Ribosomal Database Project. We generated simulated Illumina sequencing data from the constructed community and demonstrate that DISCo-microbe is capable of designing diverse communities with members distinguishable by amplicon sequencing. Using the simulated data we were able to recover sequences from between 97-100% of community members using two different post-processing workflows. Furthermore, 97-99% of sequences were assigned to a community member with zero sequences being misidentified. We then subsampled the community list using taxonomic proportions to mimic a natural plant host-associated microbiome, ultimately yielding a diverse community of 784 members. CONCLUSIONS DISCo-microbe can create a highly diverse community list of microbes that can be distinguished through 16S rRNA gene sequencing, and has the ability to subsample (i.e., design) the community for the desired number of members and taxonomic proportions. Although developed for bacteria, the program allows for any alignment input from any taxonomic group, making it broadly applicable. The software and data are freely available from GitHub (https://github.com/dlcarper/DISCo-microbe) and Python Package Index (PYPI).
Collapse
Affiliation(s)
- Dana L. Carper
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
| | - Travis J. Lawrence
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
| | - Alyssa A. Carrell
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
- Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee—Knoxville, Knoxville, TN, United States of America
| | - Dale A. Pelletier
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
| | - David J. Weston
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
| |
Collapse
|
22
|
Georgeson P, Syme A, Sloggett C, Chung J, Dashnow H, Milton M, Lonsdale A, Powell D, Seemann T, Pope B. Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software. Gigascience 2019; 8:giz109. [PMID: 31544213 PMCID: PMC6755254 DOI: 10.1093/gigascience/giz109] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Revised: 07/16/2019] [Accepted: 08/13/2019] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND Bioinformatics software tools are often created ad hoc, frequently by people without extensive training in software development. In particular, for beginners, the barrier to entry in bioinformatics software development is high, especially if they want to adopt good programming practices. Even experienced developers do not always follow best practices. This results in the proliferation of poorer-quality bioinformatics software, leading to limited scalability and inefficient use of resources; lack of reproducibility, usability, adaptability, and interoperability; and erroneous or inaccurate results. FINDINGS We have developed Bionitio, a tool that automates the process of starting new bioinformatics software projects following recommended best practices. With a single command, the user can create a new well-structured project in 1 of 12 programming languages. The resulting software is functional, carrying out a prototypical bioinformatics task, and thus serves as both a working example and a template for building new tools. Key features include command-line argument parsing, error handling, progress logging, defined exit status values, a test suite, a version number, standardized building and packaging, user documentation, code documentation, a standard open source software license, software revision control, and containerization. CONCLUSIONS Bionitio serves as a learning aid for beginner-to-intermediate bioinformatics programmers and provides an excellent starting point for new projects. This helps developers adopt good programming practices from the beginning of a project and encourages high-quality tools to be developed more rapidly. This also benefits users because tools are more easily installed and consistent in their usage. Bionitio is released as open source software under the MIT License and is available at https://github.com/bionitio-team/bionitio.
Collapse
Affiliation(s)
- Peter Georgeson
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Colorectal Oncogenomics Group, Department of Clinical Pathology, The University of Melbourne, Victorian Comprehensive Cancer Centre, 305 Grattan Street, Melbourne, Victoria, Australia 3000
| | - Anna Syme
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Royal Botanic Gardens Victoria, Birdwood Avenue, Melbourne, Victoria, Australia 3004
| | - Clare Sloggett
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
| | - Jessica Chung
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
| | - Harriet Dashnow
- Bioinformatics, Murdoch Children's Research Institute, Royal Children's Hospital, Flemington Road, Parkville, Victoria, Australia 3052
- School of BioSciences, The University of Melbourne, Royal Parade, Parkville, Victoria, Australia 3052
| | - Michael Milton
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Melbourne Genomics Health Alliance, Walter and Eliza Hall Institute, 1G Royal Parade, Parkville, Victoria, Australia 3052
| | - Andrew Lonsdale
- Bioinformatics, Murdoch Children's Research Institute, Royal Children's Hospital, Flemington Road, Parkville, Victoria, Australia 3052
- ARC Centre of Excellence in Plant Cell Walls, School of BioSciences, The University of Melbourne, Royal Parade, Parkville, Victoria, Australia 3052
| | - David Powell
- Monash Bioinformatics Platform, Biomedicine Discovery Institute, Faculty of Medicine, Nursing and Health Sciences, 15 Innovation Walk, Monash University, Clayton, Victoria, Australia 3800
| | - Torsten Seemann
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Department of Microbiology and Immunology, Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street Melbourne, Victoria, Australia 3000
| | - Bernard Pope
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Colorectal Oncogenomics Group, Department of Clinical Pathology, The University of Melbourne, Victorian Comprehensive Cancer Centre, 305 Grattan Street, Melbourne, Victoria, Australia 3000
- Department of Medicine, Central Clinical School, Monash University, Clayton, Victoria, Australia 3800
| |
Collapse
|
23
|
Sumonja N, Gemovic B, Veljkovic N, Perovic V. Automated feature engineering improves prediction of protein-protein interactions. Amino Acids 2019; 51:1187-1200. [PMID: 31278492 DOI: 10.1007/s00726-019-02756-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2019] [Accepted: 06/26/2019] [Indexed: 10/26/2022]
Abstract
Over the last decade, various machine learning (ML) and statistical approaches for protein-protein interaction (PPI) predictions have been developed to help annotating functional interactions among proteins, essential for our system-level understanding of life. Efficient ML approaches require informative and non-redundant features. In this paper, we introduce novel types of expert-crafted sequence, evolutionary and graph features and apply automatic feature engineering to further expand feature space to improve predictive modeling. The two-step automatic feature-engineering process encompasses the hybrid method for feature generation and unsupervised feature selection, followed by supervised feature selection through a genetic algorithm (GA). The optimization of both steps allows the feature-engineering procedure to operate on a large transformed feature space with no considerable computational cost and to efficiently provide newly engineered features. Based on GA and correlation filtering, we developed a stacking algorithm GA-STACK for automatic ensembling of different ML algorithms to improve prediction performance. We introduced a unified method, HP-GAS, for the prediction of human PPIs, which incorporates GA-STACK and rests on both expert-crafted and 40% of newly engineered features. The extensive cross validation and comparison with the state-of-the-art methods showed that HP-GAS represents currently the most efficient method for proteome-wide forecasting of protein interactions, with prediction efficacy of 0.93 AUC and 0.85 accuracy. We implemented the HP-GAS method as a free standalone application which is a time-efficient and easy-to-use tool. HP-GAS software with supplementary data can be downloaded from: http://www.vinca.rs/180/tools/HP-GAS.php .
Collapse
Affiliation(s)
- Neven Sumonja
- Laboratory for Bioinformatics and Computational Chemistry, Vinca Institute of Nuclear Sciences, University of Belgrade, Mike Petrovica Alasa 12-14, Vinca, Belgrade, 11351, Serbia
| | - Branislava Gemovic
- Laboratory for Bioinformatics and Computational Chemistry, Vinca Institute of Nuclear Sciences, University of Belgrade, Mike Petrovica Alasa 12-14, Vinca, Belgrade, 11351, Serbia
| | - Nevena Veljkovic
- Laboratory for Bioinformatics and Computational Chemistry, Vinca Institute of Nuclear Sciences, University of Belgrade, Mike Petrovica Alasa 12-14, Vinca, Belgrade, 11351, Serbia
| | - Vladimir Perovic
- Laboratory for Bioinformatics and Computational Chemistry, Vinca Institute of Nuclear Sciences, University of Belgrade, Mike Petrovica Alasa 12-14, Vinca, Belgrade, 11351, Serbia.
| |
Collapse
|
24
|
Mangul S, Mosqueiro T, Abdill RJ, Duong D, Mitchell K, Sarwal V, Hill B, Brito J, Littman RJ, Statz B, Lam AKM, Dayama G, Grieneisen L, Martin LS, Flint J, Eskin E, Blekhman R. Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol 2019; 17:e3000333. [PMID: 31220077 PMCID: PMC6605654 DOI: 10.1371/journal.pbio.3000333] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 07/02/2019] [Indexed: 01/07/2023] Open
Abstract
Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed "easy to install," and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software.
Collapse
Affiliation(s)
- Serghei Mangul
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, California, United States of America
| | - Thiago Mosqueiro
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, California, United States of America
| | - Richard J. Abdill
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Dat Duong
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Keith Mitchell
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Varuni Sarwal
- Indian Institute of Technology Delhi, Hauz Khas, New Delhi, India
| | - Brian Hill
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Jaqueline Brito
- Institute of Mathematics and Computer Science, University of São Paulo, São Paulo, Brazil
| | - Russell Jared Littman
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Benjamin Statz
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Angela Ka-Mei Lam
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Gargi Dayama
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Laura Grieneisen
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Lana S. Martin
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, California, United States of America
| | - Jonathan Flint
- Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University of California Los Angeles, Los Angeles, California, United States of America
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
| | - Ran Blekhman
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, United States of America
- Department of Ecology, Evolution, and Behavior, University of Minnesota, Minnesota, United States of America
| |
Collapse
|
25
|
Hart SN. Will Digital Pathology be as Disruptive as Genomics? J Pathol Inform 2018; 9:27. [PMID: 30167342 PMCID: PMC6106127 DOI: 10.4103/jpi.jpi_25_18] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2018] [Accepted: 06/20/2018] [Indexed: 11/23/2022] Open
Abstract
Digital pathology is the science of performing traditional pathological assessment in a digital environment. A digital transition is long overdue since histochemical analysis such as hematoxylin and eosin staining has remained unchanged in over 100 years. Importantly, the digitization of whole slide images further lends itself to advances in computational pathology and artificial intelligence to transform qualitative assessment into quantitative assessment. The impact of this transition from a computational infrastructure perspective is reminiscent of a similar transition in the field of genomics. In this article, I describe some of the similarities between genomics and digital pathology as well as highlight some key lessons learned to prevent the same mistakes and delays that slowed the genomics revolution.
Collapse
Affiliation(s)
- Steven N Hart
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
| |
Collapse
|
26
|
Chicco D. Ten quick tips for machine learning in computational biology. BioData Min 2017; 10:35. [PMID: 29234465 PMCID: PMC5721660 DOI: 10.1186/s13040-017-0155-3] [Citation(s) in RCA: 319] [Impact Index Per Article: 45.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 11/08/2017] [Indexed: 11/12/2022] Open
Abstract
Machine learning has become a pivotal tool for many projects in computational biology, bioinformatics, and health informatics. Nevertheless, beginners and biomedical researchers often do not have enough experience to run a data mining project effectively, and therefore can follow incorrect practices, that may lead to common mistakes or over-optimistic results. With this review, we present ten quick tips to take advantage of machine learning in any computational biology context, by avoiding some common errors that we observed hundreds of times in multiple bioinformatics projects. We believe our ten suggestions can strongly help any machine learning practitioner to carry on a successful project in computational biology and related sciences.
Collapse
Affiliation(s)
- Davide Chicco
- Princess Margaret Cancer Centre, PMCR Tower 11-401, 101 College Street, Toronto, Ontario, M5G 1L7 Canada
| |
Collapse
|
27
|
Abstract
Software produced for research, published and otherwise, suffers from a number of common problems that make it difficult or impossible to run outside the original institution or even off the primary developer's computer. We present ten simple rules to make such software robust enough to be run by anyone, anywhere, and thereby delight your users and collaborators.
Collapse
Affiliation(s)
- Morgan Taschuk
- Genome Sequence Informatics, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | - Greg Wilson
- Software Carpentry Foundation, Austin, Texas, United States of America
| |
Collapse
|