1
|
Ryzhkov FV, Ryzhkova YE, Elinson MN. Machine learning: Python tools for studying biomolecules and drug design. Mol Divers 2025:10.1007/s11030-025-11199-2. [PMID: 40301135 DOI: 10.1007/s11030-025-11199-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2025] [Accepted: 04/13/2025] [Indexed: 05/01/2025]
Abstract
The increasing adoption of computational methods and artificial intelligence in scientific research has led to a growing interest in versatile tools like Python. In the fields of medical chemistry, biochemistry, and bioinformatics, Python has emerged as a key language for tackling complex challenges. It is used to solve various tasks, such as drug discovery, high-throughput and virtual screening, protein and genome analysis, and predicting drug efficacy. This review presents a list of tools for these tasks, including scripts, libraries, and ready-made programs, and serves as a starting point for scientists wishing to apply automation or optimization to routine tasks in medical chemistry and bioinformatics.
Collapse
Affiliation(s)
- Fedor V Ryzhkov
- N. D. Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, 47 Leninsky Prospekt, 119991, Moscow, Russia.
| | - Yuliya E Ryzhkova
- N. D. Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, 47 Leninsky Prospekt, 119991, Moscow, Russia
| | - Michail N Elinson
- N. D. Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, 47 Leninsky Prospekt, 119991, Moscow, Russia
| |
Collapse
|
2
|
Pickard J, Sturgess VE, McDonald KO, Rossiter N, Arnold KB, Shah YM, Rajapakse I, Beard DA. A Hands-On Introduction to Data Analytics for Biomedical Research. FUNCTION 2025; 6:zqaf015. [PMID: 40199731 PMCID: PMC11999024 DOI: 10.1093/function/zqaf015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2024] [Revised: 03/07/2025] [Accepted: 03/12/2025] [Indexed: 04/10/2025] Open
Abstract
Artificial intelligence (AI) applications are having increasing impacts in the biomedical sciences. Modern AI tools enable uncovering hidden patterns in large datasets, forecasting outcomes, and numerous other applications. Despite the availability and power of these tools, the rapid expansion and complexity of AI applications can be daunting, and there is a conspicuous absence of consensus on their ethical and responsible use. Misapplication of AI can result in invalid, unclear, or biased outcomes, exacerbated by the unfamiliarity of many biomedical researchers with the underlying mathematical and computational principles. To address these challenges, this review and tutorial paper aims to achieve three primary objectives: (1) highlight prevalent data science applications in biomedical research, including data visualization, dimensionality reduction, missing data imputation, and predictive model training and evaluation; (2) provide comprehensible explanations of the mathematical foundations underpinning these methodologies; and (3) guide readers on the effective use and interpretation of software tools for implementing these methods in biomedical contexts. While introductory, this guide covers core principles essential for understanding advanced applications, empowering readers to critically interpret results, assess tools, and explore the potential and limitations of machine learning in their research. Ultimately, this paper serves as a practical foundation for biomedical researchers to confidently navigate the growing intersection of AI and biomedicine.
Collapse
Affiliation(s)
- Joshua Pickard
- Department of Computational Medicine and Bioinformatics, University Michigan, Ann Arbor, MI 48105, USA
| | - Victoria E Sturgess
- Department of Biomedical Engineering, University Michigan, Ann Arbor, MI 48105, USA
| | - Katherine O McDonald
- Department of Molecular and Integrative Physiology, University Michigan, Ann Arbor, MI 48105, USA
| | - Nicholas Rossiter
- Cellular and Molecular Biology Program, University of Michigan, Ann Arbor, MI 48105, USA
| | - Kelly B Arnold
- Department of Biomedical Engineering, University Michigan, Ann Arbor, MI 48105, USA
| | - Yatrik M Shah
- Department of Molecular and Integrative Physiology, University Michigan, Ann Arbor, MI 48105, USA
| | - Indika Rajapakse
- Department of Molecular and Integrative Physiology, University Michigan, Ann Arbor, MI 48105, USA
| | - Daniel A Beard
- Department of Molecular and Integrative Physiology, University Michigan, Ann Arbor, MI 48105, USA
| |
Collapse
|
3
|
Forero DA, Bonilla DA, González-Giraldo Y, Patrinos GP. An overview of key online resources for human genomics: a powerful and open toolbox for in silico research. Brief Funct Genomics 2024; 23:754-764. [PMID: 38993146 DOI: 10.1093/bfgp/elae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Revised: 06/19/2024] [Accepted: 06/25/2024] [Indexed: 07/13/2024] Open
Abstract
Recent advances in high-throughput molecular methods have led to an extraordinary volume of genomics data. Simultaneously, the progress in the computational implementation of novel algorithms has facilitated the creation of hundreds of freely available online tools for their advanced analyses. However, a general overview of the most commonly used tools for the in silico analysis of genomics data is still missing. In the current article, we present an overview of commonly used online resources for genomics research, including over 50 tools. This selection will be helpful for scientists with basic or intermediate skills in the in silico analyses of genomics data, such as researchers and students from wet labs seeking to strengthen their computational competencies. In addition, we discuss current needs and future perspectives within this field.
Collapse
Affiliation(s)
- Diego A Forero
- School of Health and Sport Sciences, Fundación Universitaria del Área Andina, Bogotá, Colombia
| | - Diego A Bonilla
- Research Division, Dynamical Business & Science Society - DBSS International SAS, Bogotá, Colombia
- Hologenomiks Research Group, Department of Genetics, Physical Anthropology and Animal Physiology, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), Leioa, Spain
| | - Yeimy González-Giraldo
- Departamento de Nutrición y Bioquímica, Facultad de Ciencias, Pontificia Universidad Javeriana, Bogotá, Colombia
| | - George P Patrinos
- Laboratory of Pharmacogenomics and Individualized Therapy, Department of Pharmacy, School of Health Science, University of Patras, Patras, Greece
- Clinical Bioinformatics Unit, Department of Pathology, School of Medicine and Health Sciences, Erasmus University Medical Center, Rotterdam, The Netherlands
- Department of Genetics and Genomics, College of Medicine and Health Sciences, United Arab Emirates University, Al-AIn, Abu Dhabi, United Arab Emirates
- Zayed Center for Health Sciences, United Arab Emirates University, Al-AIn, Abu Dhabi, United Arab Emirates
| |
Collapse
|
4
|
Sengupta P, Dutta S, Liew F, Samrot A, Dasgupta S, Rajput MA, Slama P, Kolesarova A, Roychoudhury S. Reproductomics: Exploring the Applications and Advancements of Computational Tools. Physiol Res 2024; 73:687-702. [PMID: 39530905 PMCID: PMC11629954 DOI: 10.33549/physiolres.935389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Accepted: 06/25/2024] [Indexed: 12/13/2024] Open
Abstract
Over recent decades, advancements in omics technologies, such as proteomics, genomics, epigenomics, metabolomics, transcriptomics, and microbiomics, have significantly enhanced our understanding of the molecular mechanisms underlying various physiological and pathological processes. Nonetheless, the analysis and interpretation of vast omics data concerning reproductive diseases are complicated by the cyclic regulation of hormones and multiple other factors, which, in conjunction with a genetic makeup of an individual, lead to diverse biological responses. Reproductomics investigates the interplay between a hormonal regulation of an individual, environmental factors, genetic predisposition (DNA composition and epigenome), health effects, and resulting biological outcomes. It is a rapidly emerging field that utilizes computational tools to analyze and interpret reproductive data, with the aim of improving reproductive health outcomes. It is time to explore the applications of reproductomics in understanding the molecular mechanisms underlying infertility, identification of potential biomarkers for diagnosis and treatment, and in improving assisted reproductive technologies (ARTs). Reproductomics tools include machine learning algorithms for predicting fertility outcomes, gene editing technologies for correcting genetic abnormalities, and single cell sequencing techniques for analyzing gene expression patterns at the individual cell level. However, there are several challenges, limitations and ethical issues involved with the use of reproductomics, such as the applications of gene editing technologies and their potential impact on future generations are discussed. The review comprehensively covers the applications and advancements of reproductomics, highlighting its potential to improve reproductive health outcomes and deepen our understanding of reproductive molecular mechanisms.
Collapse
Affiliation(s)
- P Sengupta
- Department of Biomedical Sciences, College of Medicine, Gulf Medical University, Ajman, UAE; Department of Life Science and Bioinformatics, Assam University, Silchar, India.
| | | | | | | | | | | | | | | | | |
Collapse
|
5
|
Spahiu E, Kastrati E, Amrute-Nayak M. PyChelator: a Python-based Colab and web application for metal chelator calculations. BMC Bioinformatics 2024; 25:239. [PMID: 39014298 PMCID: PMC11253343 DOI: 10.1186/s12859-024-05858-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 07/09/2024] [Indexed: 07/18/2024] Open
Abstract
BACKGROUND Metal ions play vital roles in regulating various biological systems, making it essential to control the concentration of free metal ions in solutions during experimental procedures. Several software applications exist for estimating the concentration of free metals in the presence of chelators, with MaxChelator being the easily accessible choice in this domain. This work aimed at developing a Python version of the software with arbitrary precision calculations, extensive new features, and a user-friendly interface to calculate the free metal ions. RESULTS We introduce the open-source PyChelator web application and the Python-based Google Colaboratory notebook, PyChelator Colab. Key features aim to improve the user experience of metal chelator calculations including input in smaller units, selection among stability constants, input of user-defined constants, and convenient download of all results in Excel format. These features were implemented in Python language by employing Google Colab, facilitating the incorporation of the calculator into other Python-based pipelines and inviting the contributions from the community of Python-using scientists for further enhancements. Arbitrary-precision arithmetic was employed by using the built-in Decimal module to obtain the most accurate results and to avoid rounding errors. No notable differences were observed compared to the results obtained from the PyChelator web application. However, comparison of different sources of stability constants showed substantial differences among them. CONCLUSIONS PyChelator is a user-friendly metal and chelator calculator that provides a platform for further development. It is provided as an interactive web application, freely available for use at https://amrutelab.github.io/PyChelator , and as a Python-based Google Colaboratory notebook at https://colab. RESEARCH google.com/github/AmruteLab/PyChelator/blob/main/PyChelator_Colab.ipynb .
Collapse
Affiliation(s)
- Emrulla Spahiu
- Institute of Molecular and Cell Physiology, Hannover Medical School, Carl-Neuberg-Str. 1, 30625, Hannover, Germany
| | - Esra Kastrati
- Lassonde School of Engineering, York University, Toronto, M3J 1P3, Canada
| | - Mamta Amrute-Nayak
- Institute of Molecular and Cell Physiology, Hannover Medical School, Carl-Neuberg-Str. 1, 30625, Hannover, Germany.
| |
Collapse
|
6
|
Koniaris D, Suciu C, Nica S. Flight to Recovery: Impact of a Rooftop Helipad Air Ambulance Service at the Emergency University Hospital of Bucharest-A Caseload Analysis of the First 3 Years After Its Implementation. Air Med J 2024; 43:321-327. [PMID: 38897695 DOI: 10.1016/j.amj.2024.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2024] [Revised: 03/03/2024] [Accepted: 03/07/2024] [Indexed: 06/21/2024]
Abstract
OBJECTIVE This observational study provides an overview of the implementation and impact of the helipad at the Bucharest Emergency University Hospital, Romania. The helipad, established in April 2019, is the only rooftop medical helipad in Bucharest authorized for day and night flights. Its influence extends beyond the local region, enabling the hospital to receive patients from various cities across Romania. The helipad has particularly strengthened the hospital's capabilities in cardiology, neurovascular emergencies, and neonatal care. Patients with acute myocardial infarctions or strokes can now be swiftly transported to the hospital for immediate intervention, whereas critically ill newborns can receive specialized care at the earliest stages of their lives. The objective of this article was to present a comprehensive timeline of the helipad's implementation and to demonstrate its transformative role in improving patient transportation, enhancing medical interventions, and elevating the overall efficiency of the health care facility. METHODS The study is a retrospective regional caseload analysis based on data gathered from the Emergency Department of the University Emergency Hospital of Bucharest database. We included all 215 air transfer missions registered between December 2019 and December 2022, exactly 3 years apart from the beginning of the program. RESULTS The findings provide valuable insights into patient demographics, case distribution, and trends, highlighting the importance of specialized medical interventions at the University Emergency Hospital of Bucharest. In particular, the mean age of patients treated at the hospital was 55.9 years, with a higher representation of males (156) than females (59). The average duration of hospitalization was 10.68 days. The study also examined transportation statistics, showing a decrease in the average number of transports per month over the years. Cardiologic cases accounted for the highest frequency (62.8%) among the analyzed categories followed by neurosurgery (8.8%) and neurologic cases (8.4%). CONCLUSION The analysis provides important insights into patient demographics, case distribution, and trends. The findings highlight the significance of specialized medical interventions, particularly in cardiology and neurosurgery, which accounted for the majority of the cases. The implementation of the helipad has greatly improved patient transportation and facilitated timely medical assistance.
Collapse
Affiliation(s)
| | - Constantin Suciu
- University of Medicine and Pharmacy Carol Davila, Bucharest, Romania; Department of Emergency Medicine, Emergency University Hospital of Bucharest, Bucharest, Romania
| | - Silvia Nica
- University of Medicine and Pharmacy Carol Davila, Bucharest, Romania; Department of Emergency Medicine, Emergency University Hospital of Bucharest, Bucharest, Romania
| |
Collapse
|
7
|
Rahman CR, Wong L. How much can ChatGPT really help computational biologists in programming? J Bioinform Comput Biol 2024; 22:2471001. [PMID: 38779779 DOI: 10.1142/s021972002471001x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2024]
Abstract
ChatGPT, a recently developed product by openAI, is successfully leaving its mark as a multi-purpose natural language based chatbot. In this paper, we are more interested in analyzing its potential in the field of computational biology. A major share of work done by computational biologists these days involve coding up bioinformatics algorithms, analyzing data, creating pipelining scripts and even machine learning modeling and feature extraction. This paper focuses on the potential influence (both positive and negative) of ChatGPT in the mentioned aspects with illustrative examples from different perspectives. Compared to other fields of computer science, computational biology has (1) less coding resources, (2) more sensitivity and bias issues (deals with medical data), and (3) more necessity of coding assistance (people from diverse background come to this field). Keeping such issues in mind, we cover use cases such as code writing, reviewing, debugging, converting, refactoring, and pipelining using ChatGPT from the perspective of computational biologists in this paper.
Collapse
Affiliation(s)
| | - Limsoon Wong
- School of Computing, National University of Singapore, 13 Computing Drive, Singapore 117417
| |
Collapse
|
8
|
Zhang S, Li H, Jing Q, Shen W, Luo W, Dai R. Anesthesia decision analysis using a cloud-based big data platform. Eur J Med Res 2024; 29:201. [PMID: 38528564 DOI: 10.1186/s40001-024-01764-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2023] [Accepted: 03/01/2024] [Indexed: 03/27/2024] Open
Abstract
Big data technologies have proliferated since the dawn of the cloud-computing era. Traditional data storage, extraction, transformation, and analysis technologies have thus become unsuitable for the large volume, diversity, high processing speed, and low value density of big data in medical strategies, which require the development of novel big data application technologies. In this regard, we investigated the most recent big data platform breakthroughs in anesthesiology and designed an anesthesia decision model based on a cloud system for storing and analyzing massive amounts of data from anesthetic records. The presented Anesthesia Decision Analysis Platform performs distributed computing on medical records via several programming tools, and provides services such as keyword search, data filtering, and basic statistics to reduce inaccurate and subjective judgments by decision-makers. Importantly, it can potentially to improve anesthetic strategy and create individualized anesthesia decisions, lowering the likelihood of perioperative complications.
Collapse
Affiliation(s)
- Shuiting Zhang
- Department of Anesthesiology, The Second Xiangya Hospital, Central South University, Changsha, 410008, Hunan, China
- Anesthesia Medical Research, Center Central, South University, Changsha, 410008, Hunan, China
| | - Hui Li
- Department of Anesthesiology, The Second Xiangya Hospital, Central South University, Changsha, 410008, Hunan, China
- Anesthesia Medical Research, Center Central, South University, Changsha, 410008, Hunan, China
| | - Qiancheng Jing
- Department of Otolaryngology Head and Neck Surgery, Hengyang Medical School, The Affiliated Changsha Central Hospital, University of South China, Changsha, 410000, Hunan, China
| | - Weiyun Shen
- Department of Anesthesiology, The Second Xiangya Hospital, Central South University, Changsha, 410008, Hunan, China
- Anesthesia Medical Research, Center Central, South University, Changsha, 410008, Hunan, China
| | - Wei Luo
- Department of Anesthesiology, The Second Xiangya Hospital, Central South University, Changsha, 410008, Hunan, China
- Anesthesia Medical Research, Center Central, South University, Changsha, 410008, Hunan, China
| | - Ruping Dai
- Department of Anesthesiology, The Second Xiangya Hospital, Central South University, Changsha, 410008, Hunan, China.
- Anesthesia Medical Research, Center Central, South University, Changsha, 410008, Hunan, China.
| |
Collapse
|
9
|
Maurer JJ, Cheng Y, Pedroso A, Thompson KK, Akter S, Kwan T, Morota G, Kinstler S, Porwollik S, McClelland M, Escalante-Semerena JC, Lee MD. Peeling back the many layers of competitive exclusion. Front Microbiol 2024; 15:1342887. [PMID: 38591029 PMCID: PMC11000858 DOI: 10.3389/fmicb.2024.1342887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 02/19/2024] [Indexed: 04/10/2024] Open
Abstract
Baby chicks administered a fecal transplant from adult chickens are resistant to Salmonella colonization by competitive exclusion. A two-pronged approach was used to investigate the mechanism of this process. First, Salmonella response to an exclusive (Salmonella competitive exclusion product, Aviguard®) or permissive microbial community (chicken cecal contents from colonized birds containing 7.85 Log10Salmonella genomes/gram) was assessed ex vivo using a S. typhimurium reporter strain with fluorescent YFP and CFP gene fusions to rrn and hilA operon, respectively. Second, cecal transcriptome analysis was used to assess the cecal communities' response to Salmonella in chickens with low (≤5.85 Log10 genomes/g) or high (≥6.00 Log10 genomes/g) Salmonella colonization. The ex vivo experiment revealed a reduction in Salmonella growth and hilA expression following co-culture with the exclusive community. The exclusive community also repressed Salmonella's SPI-1 virulence genes and LPS modification, while the anti-virulence/inflammatory gene avrA was upregulated. Salmonella transcriptome analysis revealed significant metabolic disparities in Salmonella grown with the two different communities. Propanediol utilization and vitamin B12 synthesis were central to Salmonella metabolism co-cultured with either community, and mutations in propanediol and vitamin B12 metabolism altered Salmonella growth in the exclusive community. There were significant differences in the cecal community's stress response to Salmonella colonization. Cecal community transcripts indicated that antimicrobials were central to the type of stress response detected in the low Salmonella abundance community, suggesting antagonism involved in Salmonella exclusion. This study indicates complex community interactions that modulate Salmonella metabolism and pathogenic behavior and reduce growth through antagonism may be key to exclusion.
Collapse
Affiliation(s)
- John J. Maurer
- School of Animal Sciences, College of Veterinary Medicine, Virginia Polytechnic Institute and State University, Blacksburg, VA, United States
| | - Ying Cheng
- Department of Population Health, University of Georgia, Athens, GA, United States
| | - Adriana Pedroso
- Department of Population Health, University of Georgia, Athens, GA, United States
| | - Kasey K. Thompson
- Department of Population Health, University of Georgia, Athens, GA, United States
| | - Shamima Akter
- Department of Biomedical Sciences and Pathobiology, College of Veterinary Medicine, Virginia Polytechnic Institute and State University, Blacksburg, VA, United States
| | - Tiffany Kwan
- Department of Population Health, University of Georgia, Athens, GA, United States
| | - Gota Morota
- School of Animal Sciences, College of Veterinary Medicine, Virginia Polytechnic Institute and State University, Blacksburg, VA, United States
| | - Sydney Kinstler
- School of Animal Sciences, College of Veterinary Medicine, Virginia Polytechnic Institute and State University, Blacksburg, VA, United States
| | - Steffen Porwollik
- Department of Microbiology and Molecular Genetics, University of California, Irvine, Irvine, CA, United States
| | - Michael McClelland
- Department of Microbiology and Molecular Genetics, University of California, Irvine, Irvine, CA, United States
| | | | - Margie D. Lee
- Department of Biomedical Sciences and Pathobiology, College of Veterinary Medicine, Virginia Polytechnic Institute and State University, Blacksburg, VA, United States
| |
Collapse
|
10
|
Mullie L, Afilalo J, Archambault P, Bouchakri R, Brown K, Buckeridge DL, Cavayas YA, Turgeon AF, Martineau D, Lamontagne F, Lebrasseur M, Lemieux R, Li J, Sauthier M, St-Onge P, Tang A, Witteman W, Chassé M. CODA: an open-source platform for federated analysis and machine learning on distributed healthcare data. J Am Med Inform Assoc 2024; 31:651-665. [PMID: 38128123 PMCID: PMC10873779 DOI: 10.1093/jamia/ocad235] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 10/28/2023] [Accepted: 12/02/2023] [Indexed: 12/23/2023] Open
Abstract
OBJECTIVES Distributed computations facilitate multi-institutional data analysis while avoiding the costs and complexity of data pooling. Existing approaches lack crucial features, such as built-in medical standards and terminologies, no-code data visualizations, explicit disclosure control mechanisms, and support for basic statistical computations, in addition to gradient-based optimization capabilities. MATERIALS AND METHODS We describe the development of the Collaborative Data Analysis (CODA) platform, and the design choices undertaken to address the key needs identified during our survey of stakeholders. We use a public dataset (MIMIC-IV) to demonstrate end-to-end multi-modal FL using CODA. We assessed the technical feasibility of deploying the CODA platform at 9 hospitals in Canada, describe implementation challenges, and evaluate its scalability on large patient populations. RESULTS The CODA platform was designed, developed, and deployed between January 2020 and January 2023. Software code, documentation, and technical documents were released under an open-source license. Multi-modal federated averaging is illustrated using the MIMIC-IV and MIMIC-CXR datasets. To date, 8 out of the 9 participating sites have successfully deployed the platform, with a total enrolment of >1M patients. Mapping data from legacy systems to FHIR was the biggest barrier to implementation. DISCUSSION AND CONCLUSION The CODA platform was developed and successfully deployed in a public healthcare setting in Canada, with heterogeneous information technology systems and capabilities. Ongoing efforts will use the platform to develop and prospectively validate models for risk assessment, proactive monitoring, and resource usage. Further work will also make tools available to facilitate migration from legacy formats to FHIR and DICOM.
Collapse
Affiliation(s)
- Louis Mullie
- Department of Medicine, Centre Hospitalier de l'Université de Montréal, Montréal, H2X 3E4, Canada
- Faculty of Medicine, Université de Montréal, Montréal, H3C 3J7, Canada
- Mila Quebec Artificial Intelligence Institute, Montréal, H2S 3H1, Canada
| | - Jonathan Afilalo
- Department of Medicine, Jewish General Hospital, Montréal, H3T 1E4, Canada
| | - Patrick Archambault
- Department of Emergency Medicine and Family Medicine, Université Laval, Québec, G1V 0A6, Canada
- Department of Anesthesiology and Critical Care Medicine, Université Laval, Québec, G1V 0A6, Canada
- Centre de Recherche Intégré pour un Système Apprenant en santé et Services Sociaux, Centre intégré de santé et de Services Sociaux de Chaudière-Appalaches, Lévis, G6V 3Z1, Canada
| | - Rima Bouchakri
- Centre de Recherche du Centre Hospitalier de l'Université de Montréal, Université de Montréal, Montréal, H2X 0A9, Canada
| | - Kip Brown
- Centre de Recherche du Centre Hospitalier de l'Université de Montréal, Université de Montréal, Montréal, H2X 0A9, Canada
| | - David L Buckeridge
- Mila Quebec Artificial Intelligence Institute, Montréal, H2S 3H1, Canada
- Department of Epidemiology and Biostatistics, School of Population and Global Health, McGill University Health Centre, Montréal, H3A 1G1, Canada
| | | | - Alexis F Turgeon
- Department of Anesthesiology and Critical Care Medicine, Université Laval, Québec, G1V 0A6, Canada
- Centre de recherche du CHU de Québec-Université Laval, Université Laval, Québec, G1V 4G2, Canada
| | - Denis Martineau
- Centre de recherche du CHU de Québec-Université Laval, Université Laval, Québec, G1V 4G2, Canada
| | - François Lamontagne
- Centre de recherche du CHUS, Centre Hospitalier Universitaire de Sherbrooke, Sherbrooke, J1G 2E8, Canada
| | - Martine Lebrasseur
- Centre de Recherche du Centre Hospitalier de l'Université de Montréal, Université de Montréal, Montréal, H2X 0A9, Canada
| | - Renald Lemieux
- Centre de recherche du CHUS, Centre Hospitalier Universitaire de Sherbrooke, Sherbrooke, J1G 2E8, Canada
| | - Jeffrey Li
- Centre de Recherche du Centre Hospitalier de l'Université de Montréal, Université de Montréal, Montréal, H2X 0A9, Canada
| | - Michaël Sauthier
- Faculty of Medicine, Université de Montréal, Montréal, H3C 3J7, Canada
- Department of Pediatrics, Université de Montréal and CHU Sainte-Justine Research Centre, Montréal, H3C 3J7, Canada
| | - Pascal St-Onge
- Centre de Recherche du Centre Hospitalier de l'Université de Montréal, Université de Montréal, Montréal, H2X 0A9, Canada
| | - An Tang
- Faculty of Medicine, Université de Montréal, Montréal, H3C 3J7, Canada
- Department of Radiology, Centre Hospitalier de l’Université de Montréal, Montréal, H2X 3E4, Canada
| | - William Witteman
- Centre de Recherche Intégré pour un Système Apprenant en santé et Services Sociaux, Centre intégré de santé et de Services Sociaux de Chaudière-Appalaches, Lévis, G6V 3Z1, Canada
| | - Michaël Chassé
- Department of Medicine, Centre Hospitalier de l'Université de Montréal, Montréal, H2X 3E4, Canada
- Faculty of Medicine, Université de Montréal, Montréal, H3C 3J7, Canada
| |
Collapse
|
11
|
Piccolo SR, Denny P, Luxton-Reilly A, Payne SH, Ridge PG. Evaluating a large language model's ability to solve programming exercises from an introductory bioinformatics course. PLoS Comput Biol 2023; 19:e1011511. [PMID: 37769024 PMCID: PMC10564134 DOI: 10.1371/journal.pcbi.1011511] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Revised: 10/10/2023] [Accepted: 09/12/2023] [Indexed: 09/30/2023] Open
Abstract
Computer programming is a fundamental tool for life scientists, allowing them to carry out essential research tasks. However, despite various educational efforts, learning to write code can be a challenging endeavor for students and researchers in life-sciences disciplines. Recent advances in artificial intelligence have made it possible to translate human-language prompts to functional code, raising questions about whether these technologies can aid (or replace) life scientists' efforts to write code. Using 184 programming exercises from an introductory-bioinformatics course, we evaluated the extent to which one such tool-OpenAI's ChatGPT-could successfully complete programming tasks. ChatGPT solved 139 (75.5%) of the exercises on its first attempt. For the remaining exercises, we provided natural-language feedback to the model, prompting it to try different approaches. Within 7 or fewer attempts, ChatGPT solved 179 (97.3%) of the exercises. These findings have implications for life-sciences education and research. Instructors may need to adapt their pedagogical approaches and assessment techniques to account for these new capabilities that are available to the general public. For some programming tasks, researchers may be able to work in collaboration with machine-learning models to produce functional code.
Collapse
Affiliation(s)
- Stephen R. Piccolo
- Department of Biology, Brigham Young University, Provo, Utah, United States of America
| | - Paul Denny
- School of Computer Science, The University of Auckland, Auckland, New Zealand
| | | | - Samuel H. Payne
- Department of Biology, Brigham Young University, Provo, Utah, United States of America
| | - Perry G. Ridge
- Department of Biology, Brigham Young University, Provo, Utah, United States of America
| |
Collapse
|
12
|
Adenaike O, Olabanjo OE, Adedeji AA. Integrating computational skills in undergraduate Microbiology curricula in developing countries. Biol Methods Protoc 2023; 8:bpad008. [PMID: 37396465 PMCID: PMC10310463 DOI: 10.1093/biomethods/bpad008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Revised: 05/19/2023] [Accepted: 05/21/2023] [Indexed: 07/04/2023] Open
Abstract
The employability of young graduates has gained increasing significance in the labour market of the 21st century. Universities turn out millions of graduates annually, but at the same time, employers highlight their lack of the requisite skills for sustainable employment. We live today in a world of data, and therefore courses that feature numerical and computational tools to gather and analyse data are to be sourced for and integrated into life sciences' curricula as they provide a number of benefits for both the students and faculty members that are engaged in teaching the courses. The lack of this teaching in undergraduate Microbiology curricula is devastating and leaves a knowledge gap in the graduates that are turned out. This results in an inability of the emerging graduates to compete favourably with their counterparts from other parts of the world. There is a necessity on the part of life science educators to adapt their teaching strategies to best support students' curricula that prepare them for careers in science. Bioinformatics, Statistics and Programming are key computational skills to embrace by life scientists and the need for training beginning at undergraduate level cannot be overemphasized. This article reviews the need to integrate computational skills in undergraduate Microbiology curricula in developing countries with emphasis on Nigeria.
Collapse
Affiliation(s)
- Omolara Adenaike
- Correspondence address. Department of Biological Sciences (Microbiology Unit), Oduduwa University, Ipetumodu, Nigeria. Tel: +2348061278100; E-mail:
| | | | | |
Collapse
|
13
|
Zhang P, Wang M, Zhou T, Chen D. SeqWiz: a modularized toolkit for next-generation protein sequence database management and analysis. BMC Bioinformatics 2023; 24:201. [PMID: 37194023 DOI: 10.1186/s12859-023-05334-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Accepted: 05/11/2023] [Indexed: 05/18/2023] Open
Abstract
BACKGROUND Current proteomic technologies are fast-evolving to uncover the complex features of sequence processes, variations and modifications. Thus, protein sequence database and the corresponding softwares should also be improved to solve this issue. RESULTS We developed a state-of-the-art toolkit (SeqWiz) for constructing next-generation sequence databases and performing proteomic-centric sequence analyses. First, we proposed two derived data formats: SQPD (a well-structured and high-performance local sequence database based on SQLite), and SET (an associated list of selected entries based on JSON). The SQPD format follows the basic standards of the emerging PEFF format, which also aims to facilitate the search of complex proteoform. The SET format is designed for generating subsets with with high-efficiency. These formats are shown to greatly outperform the conventional FASTA or PEFF formats in time and resource consumption. Then, we mainly focused on the UniProt knowledgebase and developed a collection of open-source tools and basic modules for retrieving species-specific databases, formats conversion, sequence generation, sequence filter, and sequence analysis. These tools are implemented by using the Python language and licensed under the GNU General Public Licence V3. The source codes and distributions are freely available at GitHub ( https://github.com/fountao/protwiz/tree/main/seqwiz ). CONCLUSIONS SeqWiz is designed to be a collection of modularized tools, which is friendly to both end-users for preparing easy-to-use sequence databases as well as bioinformaticians for performing downstream sequence analysis. Besides the novel formats, it also provides compatible functions for handling the traditional text based FASTA or PEFF formats. We believe that SeqWiz will promote the implementing of complementary proteomics for data renewal and proteoform analysis to achieve precision proteomics. Additionally, it can also drive the improvement of proteomic standardization and the development of next-generation proteomic softwares.
Collapse
Affiliation(s)
- Ping Zhang
- Research Institute for Reproductive Medicine and Genetic Diseases, The Affiliated Wuxi Maternity and Child Health Care Hospital of Nanjing Medical University, Wuxi, 214002, China
| | - Min Wang
- Research Institute for Reproductive Medicine and Genetic Diseases, The Affiliated Wuxi Maternity and Child Health Care Hospital of Nanjing Medical University, Wuxi, 214002, China
| | - Tao Zhou
- Research Institute for Reproductive Medicine and Genetic Diseases, The Affiliated Wuxi Maternity and Child Health Care Hospital of Nanjing Medical University, Wuxi, 214002, China.
- Wuxi Maternity and Child Health Care Hospital, Wuxi School of Medicine, Jiangnan University, Wuxi, China.
| | - Daozhen Chen
- Research Institute for Reproductive Medicine and Genetic Diseases, The Affiliated Wuxi Maternity and Child Health Care Hospital of Nanjing Medical University, Wuxi, 214002, China.
| |
Collapse
|
14
|
Roesch E, Greener JG, MacLean AL, Nassar H, Rackauckas C, Holy TE, Stumpf MPH. Julia for biologists. Nat Methods 2023; 20:655-664. [PMID: 37024649 PMCID: PMC10216852 DOI: 10.1038/s41592-023-01832-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Accepted: 02/27/2023] [Indexed: 04/08/2023]
Abstract
Major computational challenges exist in relation to the collection, curation, processing and analysis of large genomic and imaging datasets, as well as the simulation of larger and more realistic models in systems biology. Here we discuss how a relative newcomer among programming languages-Julia-is poised to meet the current and emerging demands in the computational biosciences and beyond. Speed, flexibility, a thriving package ecosystem and readability are major factors that make high-performance computing and data analysis available to an unprecedented degree. We highlight how Julia's design is already enabling new ways of analyzing biological data and systems, and we provide a list of resources that can facilitate the transition into Julian computing.
Collapse
Affiliation(s)
- Elisabeth Roesch
- School of Mathematics and Statistics, University of Melbourne, Melbourne, Victoria, Australia
- Melbourne Integrative Genomics, University of Melbourne, Melbourne, Victoria, Australia
- JuliaHub, Somerville, MA, USA
| | - Joe G Greener
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
| | - Adam L MacLean
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | | | - Christopher Rackauckas
- JuliaHub, Somerville, MA, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
- Pumas-AI, Centreville, VA, USA
| | - Timothy E Holy
- Departments of Neuroscience and Biomedical Engineering, Washington University in St. Louis, St. Louis, MO, USA
| | - Michael P H Stumpf
- School of Mathematics and Statistics, University of Melbourne, Melbourne, Victoria, Australia.
- Melbourne Integrative Genomics, University of Melbourne, Melbourne, Victoria, Australia.
- School of BioSciences, The University of Melbourne, Melbourne, Victoria, Australia.
- ARC Centre of Excellence for the Mathematical Analysis of Cellular Systems, Melbourne, Victoria, Australia.
| |
Collapse
|
15
|
Rather MA, Agarwal D, Bhat TA, Khan IA, Zafar I, Kumar S, Amin A, Sundaray JK, Qadri T. Bioinformatics approaches and big data analytics opportunities in improving fisheries and aquaculture. Int J Biol Macromol 2023; 233:123549. [PMID: 36740117 DOI: 10.1016/j.ijbiomac.2023.123549] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Revised: 01/30/2023] [Accepted: 01/31/2023] [Indexed: 02/05/2023]
Abstract
Aquaculture has witnessed an excellent growth rate during the last two decades and offers huge potential to provide nutritional as well as livelihood security. Genomic research has contributed significantly toward the development of beneficial technologies for aquaculture. The existing high throughput technologies like next-generation technologies generate oceanic data which requires extensive analysis using appropriate tools. Bioinformatics is a rapidly evolving science that involves integrating gene based information and computational technology to produce new knowledge for the benefit of aquaculture. Bioinformatics provides new opportunities as well as challenges for information and data processing in new generation aquaculture. Rapid technical advancements have opened up a world of possibilities for using current genomics to improve aquaculture performance. Understanding the genes that govern economically relevant characteristics, necessitates a significant amount of additional research. The various dimensions of data sources includes next-generation DNA sequencing, protein sequencing, RNA sequencing gene expression profiles, metabolic pathways, molecular markers, and so on. Appropriate bioinformatics tools are developed to mine the biologically relevant and commercially useful results. The purpose of this scoping review is to present various arms of diverse bioinformatics tools with special emphasis on practical translation to the aquaculture industry.
Collapse
Affiliation(s)
- Mohd Ashraf Rather
- Division of Fish Genetics and Biotechnology, Faculty of Fisheries Ganderbal, Sher-e- Kashmir University of Agricultural Science and Technology, Kashmir, India.
| | - Deepak Agarwal
- Institute of Fisheries Post Graduation Studies OMR Campus, Vaniyanchavadi, Chennai, India
| | | | - Irfan Ahamd Khan
- Division of Fish Genetics and Biotechnology, Faculty of Fisheries Ganderbal, Sher-e- Kashmir University of Agricultural Science and Technology, Kashmir, India
| | - Imran Zafar
- Department of Bioinformatics and Computational Biology, Virtual University Punjab, Pakistan
| | - Sujit Kumar
- Department of Bioinformatics and Computational Biology, Virtual University Punjab, Pakistan
| | - Adnan Amin
- Postgraduate Institute of Fisheries Education and Research Kamdhenu University, Gandhinagar-India University of Kurasthra, India; Department of Aquatic Environmental Management, Faculty of Fisheries Rangil- Ganderbel -SKUAST-K, India
| | - Jitendra Kumar Sundaray
- ICAR-Central Institute of Freshwater Aquaculture, Kausalyaganga, Bhubaneswar, Odisha 751002, India
| | - Tahiya Qadri
- Division of Food Science and Technology, SKUAST-K, Shalimar, India
| |
Collapse
|
16
|
Prediction and Modeling of Protein–Protein Interactions Using “Spotted” Peptides with a Template-Based Approach. Biomolecules 2022; 12:biom12020201. [PMID: 35204702 PMCID: PMC8961654 DOI: 10.3390/biom12020201] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Revised: 01/20/2022] [Accepted: 01/22/2022] [Indexed: 12/10/2022] Open
Abstract
Protein–peptide interactions (PpIs) are a subset of the overall protein–protein interaction (PPI) network in the living cell and are pivotal for the majority of cell processes and functions. High-throughput methods to detect PpIs and PPIs usually require time and costs that are not always affordable. Therefore, reliable in silico predictions represent a valid and effective alternative. In this work, a new algorithm is described, implemented in a freely available tool, i.e., “PepThreader”, to carry out PPIs and PpIs prediction and analysis. PepThreader threads multiple fragments derived from a full-length protein sequence (or from a peptide library) onto a second template peptide, in complex with a protein target, “spotting” the potential binding peptides and ranking them according to a sequence-based and structure-based threading score. The threading algorithm first makes use of a scoring function that is based on peptides sequence similarity. Then, a rerank of the initial hits is performed, according to structure-based scoring functions. PepThreader has been benchmarked on a dataset of 292 protein–peptide complexes that were collected from existing databases of experimentally determined protein–peptide interactions. An accuracy of 80%, when considering the top predicted 25 hits, was achieved, which performs in a comparable way with the other state-of-art tools in PPIs and PpIs modeling. Nonetheless, PepThreader is unique in that it is able at the same time to spot a binding peptide within a full-length sequence involved in PPI and model its structure within the receptor. Therefore, PepThreader adds to the already-available tools supporting the experimental PPIs and PpIs identification and characterization.
Collapse
|
17
|
Prasai R, Schwertner TW, Mainali K, Mathewson H, Kafley H, Thapa S, Adhikari D, Medley P, Drake J. Application of Google earth engine python API and NAIP imagery for land use and land cover classification: A case study in Florida, USA. ECOL INFORM 2021. [DOI: 10.1016/j.ecoinf.2021.101474] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
18
|
Zuvanov L, Basso Garcia AL, Correr FH, Bizarria R, Filho APDC, da Costa AH, Thomaz AT, Pinheiro ALM, Riaño-Pachón DM, Winck FV, Esteves FG, Margarido GRA, Casagrande GMS, Frajacomo HC, Martins L, Cavalheiro MF, Grachet NG, da Silva RGC, Cerri R, Ramos RTJ, de Medeiros SDS, Tavares TV, Corrêa dos Santos RA. The experience of teaching introductory programming skills to bioscientists in Brazil. PLoS Comput Biol 2021; 17:e1009534. [PMID: 34762646 PMCID: PMC8584955 DOI: 10.1371/journal.pcbi.1009534] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Computational biology has gained traction as an independent scientific discipline over the last years in South America. However, there is still a growing need for bioscientists, from different backgrounds, with different levels, to acquire programming skills, which could reduce the time from data to insights and bridge communication between life scientists and computer scientists. Python is a programming language extensively used in bioinformatics and data science, which is particularly suitable for beginners. Here, we describe the conception, organization, and implementation of the Brazilian Python Workshop for Biological Data. This workshop has been organized by graduate and undergraduate students and supported, mostly in administrative matters, by experienced faculty members since 2017. The workshop was conceived for teaching bioscientists, mainly students in Brazil, on how to program in a biological context. The goal of this article was to share our experience with the 2020 edition of the workshop in its virtual format due to the Coronavirus Disease 2019 (COVID-19) pandemic and to compare and contrast this year's experience with the previous in-person editions. We described a hands-on and live coding workshop model for teaching introductory Python programming. We also highlighted the adaptations made from in-person to online format in 2020, the participants' assessment of learning progression, and general workshop management. Lastly, we provided a summary and reflections from our personal experiences from the workshops of the last 4 years. Our takeaways included the benefits of the learning from learners' feedback (LLF) that allowed us to improve the workshop in real time, in the short, and likely in the long term. We concluded that the Brazilian Python Workshop for Biological Data is a highly effective workshop model for teaching a programming language that allows bioscientists to go beyond an initial exploration of programming skills for data analysis in the medium to long term.
Collapse
Affiliation(s)
- Luíza Zuvanov
- São Carlos Institute of Physics, University of São Paulo, São Carlos, Brazil
| | - Ana Letycia Basso Garcia
- Department of Genetics, Luiz de Queiroz College of Agriculture, University of São Paulo, Piracicaba, Brazil
| | - Fernando Henrique Correr
- Department of Genetics, Luiz de Queiroz College of Agriculture, University of São Paulo, Piracicaba, Brazil
| | - Rodolfo Bizarria
- Department of General and Applied Biology, São Paulo State University, Rio Claro, Brazil
- Center of the Study of Social Insects, Department of General and Applied Biology, Institute of Biosciences of Rio Claro, São Paulo State University, Rio Claro, Brazil
| | | | | | - Andréa T. Thomaz
- School of Natural Sciences, Universidad del Rosario, Bogotá, Colombia
| | - Ana Lucia Mendes Pinheiro
- Department of Genetics, Luiz de Queiroz College of Agriculture, University of São Paulo, Piracicaba, Brazil
| | - Diego Mauricio Riaño-Pachón
- Computational, Evolutionary and Systems Biology Lab, Center for Nuclear Energy in Agriculture, University of São Paulo, Piracicaba, Brazil
| | - Flavia Vischi Winck
- Regulatory Systems Biology Lab, Center for Nuclear Energy in Agriculture, University of São Paulo, Piracicaba, Brazil
| | - Franciele Grego Esteves
- Center of the Study of Social Insects, Department of General and Applied Biology, Institute of Biosciences of Rio Claro, São Paulo State University, Rio Claro, Brazil
| | | | | | | | - Leonardo Martins
- Paulista School of Medicine, Federal University of São Paulo, São Paulo, Brazil
| | - Mariana Feitosa Cavalheiro
- Department of Genetics, Evolution, Microbiology and Immunology, Institute of Biology, University of Campinas, Campinas, Brazil
- Genomics for Climate Change Research Center, University of Campinas, Campinas, Brazil
| | | | - Raniere Gaia Costa da Silva
- Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong, Special Administrative Region, People’s Republic of China
| | - Ricardo Cerri
- Department of Computer Science, Federal University of São Carlos, São Carlos, Brazil
| | | | | | - Thayana Vieira Tavares
- Department of Genetics and Evolution, Federal University of São Carlos, São Carlos, Brazil
| | - Renato Augusto Corrêa dos Santos
- School of Pharmaceutical Sciences of Ribeirao Preto, University of São Paulo, Ribeirão Preto, Brazil
- Institute of Biology, State University of Campinas, Campinas, Brazil
- * E-mail:
| |
Collapse
|
19
|
Allbee Q, Barber R. Writing python programs to map alleles related to genetic disease. BIOCHEMISTRY AND MOLECULAR BIOLOGY EDUCATION : A BIMONTHLY PUBLICATION OF THE INTERNATIONAL UNION OF BIOCHEMISTRY AND MOLECULAR BIOLOGY 2021; 49:677-678. [PMID: 33991167 DOI: 10.1002/bmb.21528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/06/2020] [Accepted: 05/06/2021] [Indexed: 06/12/2023]
Abstract
Biology is a data-driven discipline facilitated greatly by computer programming skills. This article describes an introductory experiential programming activity that can be integrated into distance learning environments. Students are asked to develop their own Python programs to identify the nature of alleles linked to disease. This activity effectively engages students in a problem solving exercise that provides an opportunity for application of basic programming skills as well as understanding eukaryotic gene structure. We provide sets of mapped alleles for two well-known genes, CFTR and HFE, as well as a suite of relevant Python programs to achieve these outcomes or allow subsequent exercise modifications.
Collapse
Affiliation(s)
- Quinn Allbee
- University of Wisconsin-Parkside, Kenosha, Wisconsin, USA
| | - Robert Barber
- University of Wisconsin-Parkside, Kenosha, Wisconsin, USA
| |
Collapse
|
20
|
Abstract
Cell imaging has entered the 'Big Data' era. New technologies in light microscopy and molecular biology have led to an explosion in high-content, dynamic and multidimensional imaging data. Similar to the 'omics' fields two decades ago, our current ability to process, visualize, integrate and mine this new generation of cell imaging data is becoming a critical bottleneck in advancing cell biology. Computation, traditionally used to quantitatively test specific hypotheses, must now also enable iterative hypothesis generation and testing by deciphering hidden biologically meaningful patterns in complex, dynamic or high-dimensional cell image data. Data science is uniquely positioned to aid in this process. In this Perspective, we survey the rapidly expanding new field of data science in cell imaging. Specifically, we highlight how data science tools are used within current image analysis pipelines, propose a computation-first approach to derive new hypotheses from cell image data, identify challenges and describe the next frontiers where we believe data science will make an impact. We also outline steps to ensure broad access to these powerful tools - democratizing infrastructure availability, developing sensitive, robust and usable tools, and promoting interdisciplinary training to both familiarize biologists with data science and expose data scientists to cell imaging.
Collapse
Affiliation(s)
- Meghan K Driscoll
- Department of Bioinformatics, UT Southwestern Medical Center, Dallas, TX 75390, USA
| | - Assaf Zaritsky
- Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel
| |
Collapse
|
21
|
Elghafari A, Finkelstein J. Automated Identification of Common Disease-Specific Outcomes for Comparative Effectiveness Research Using ClinicalTrials.gov: Algorithm Development and Validation Study. JMIR Med Inform 2021; 9:e18298. [PMID: 33460388 PMCID: PMC7899806 DOI: 10.2196/18298] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Revised: 08/30/2020] [Accepted: 01/17/2021] [Indexed: 01/02/2023] Open
Abstract
Background Common disease-specific outcomes are vital for ensuring comparability of clinical trial data and enabling meta analyses and interstudy comparisons. Traditionally, the process of deciding which outcomes should be recommended as common for a particular disease relied on assembling and surveying panels of subject-matter experts. This is usually a time-consuming and laborious process. Objective The objectives of this work were to develop and evaluate a generalized pipeline that can automatically identify common outcomes specific to any given disease by finding, downloading, and analyzing data of previous clinical trials relevant to that disease. Methods An automated pipeline to interface with ClinicalTrials.gov’s application programming interface and download the relevant trials for the input condition was designed. The primary and secondary outcomes of those trials were parsed and grouped based on text similarity and ranked based on frequency. The quality and usefulness of the pipeline’s output were assessed by comparing the top outcomes identified by it for chronic obstructive pulmonary disease (COPD) to a list of 80 outcomes manually abstracted from the most frequently cited and comprehensive reviews delineating clinical outcomes for COPD. Results The common disease-specific outcome pipeline successfully downloaded and processed 3876 studies related to COPD. Manual verification indicated that the pipeline was downloading and processing the same number of trials as were obtained from the self-service ClinicalTrials.gov portal. Evaluating the automatically identified outcomes against the manually abstracted ones showed that the pipeline achieved a recall of 92% and precision of 79%. The precision number indicated that the pipeline was identifying many outcomes that were not covered in the literature reviews. Assessment of those outcomes indicated that they are relevant to COPD and could be considered in future research. Conclusions An automated evidence-based pipeline can identify common clinical trial outcomes of comparable breadth and quality as the outcomes identified in comprehensive literature reviews. Moreover, such an approach can highlight relevant outcomes for further consideration.
Collapse
Affiliation(s)
- Anas Elghafari
- Center for Biomedical and Population Health Informatics, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Joseph Finkelstein
- Center for Biomedical and Population Health Informatics, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| |
Collapse
|
22
|
Haiman ZB, Zielinski DC, Koike Y, Yurkovich JT, Palsson BO. MASSpy: Building, simulating, and visualizing dynamic biological models in Python using mass action kinetics. PLoS Comput Biol 2021; 17:e1008208. [PMID: 33507922 PMCID: PMC7872247 DOI: 10.1371/journal.pcbi.1008208] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Revised: 02/09/2021] [Accepted: 12/21/2020] [Indexed: 01/01/2023] Open
Abstract
Mathematical models of metabolic networks utilize simulation to study system-level mechanisms and functions. Various approaches have been used to model the steady state behavior of metabolic networks using genome-scale reconstructions, but formulating dynamic models from such reconstructions continues to be a key challenge. Here, we present the Mass Action Stoichiometric Simulation Python (MASSpy) package, an open-source computational framework for dynamic modeling of metabolism. MASSpy utilizes mass action kinetics and detailed chemical mechanisms to build dynamic models of complex biological processes. MASSpy adds dynamic modeling tools to the COnstraint-Based Reconstruction and Analysis Python (COBRApy) package to provide an unified framework for constraint-based and kinetic modeling of metabolic networks. MASSpy supports high-performance dynamic simulation through its implementation of libRoadRunner: the Systems Biology Markup Language (SBML) simulation engine. Three examples are provided to demonstrate how to use MASSpy: (1) a validation of the MASSpy modeling tool through dynamic simulation of detailed mechanisms of enzyme regulation; (2) a feature demonstration using a workflow for generating ensemble of kinetic models using Monte Carlo sampling to approximate missing numerical values of parameters and to quantify biological uncertainty, and (3) a case study in which MASSpy is utilized to overcome issues that arise when integrating experimental data with the computation of functional states of detailed biological mechanisms. MASSpy represents a powerful tool to address challenges that arise in dynamic modeling of metabolic networks, both at small and large scales.
Collapse
Affiliation(s)
- Zachary B. Haiman
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Daniel C. Zielinski
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Yuko Koike
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - James T. Yurkovich
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Bernhard O. Palsson
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| |
Collapse
|
23
|
Du L, Liu Q, Fan Z, Tang J, Zhang X, Price M, Yue B, Zhao K. Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files. Brief Bioinform 2020; 22:6042388. [PMID: 33341884 DOI: 10.1093/bib/bbaa368] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Revised: 10/30/2020] [Accepted: 11/17/2020] [Indexed: 11/14/2022] Open
Abstract
FASTA and FASTQ are the most widely used biological data formats that have become the de facto standard to exchange sequence data between bioinformatics tools. With the avalanche of next-generation sequencing data, the amount of sequence data being deposited and accessed in FASTA/Q formats is increasing dramatically. However, the existing tools have very low efficiency at random retrieval of subsequences due to the requirement of loading the entire index into memory. In addition, most existing tools have no capability to build index for large FASTA/Q files because of the limited memory. Furthermore, the tools do not provide support to randomly accessing sequences from FASTA/Q files compressed by gzip, which is extensively adopted by most public databases to compress data for saving storage. In this study, we developed pyfastx as a versatile Python package with commonly used command-line tools to overcome the above limitations. Compared to other tools, pyfastx yielded the highest performance in terms of building index and random access to sequences, particularly when dealing with large FASTA/Q files with hundreds of millions of sequences. A key advantage of pyfastx over other tools is that it offers an efficient way to randomly extract subsequences directly from gzip compressed FASTA/Q files without needing to uncompress beforehand. Pyfastx can easily be installed from PyPI (https://pypi.org/project/pyfastx) and the source code is freely available at https://github.com/lmdu/pyfastx.
Collapse
Affiliation(s)
- Lianming Du
- Institute for Advanced Study, Chengdu University, Chengdu, China
| | - Qin Liu
- College of Life Sciences and Food Engineering, Yibin University, Yibin, China
| | - Zhenxin Fan
- Key Laboratory of Bio-Resources and Eco-Environment, Ministry of Education, College of Life Science, Sichuan University, Chengdu, China
| | - Jie Tang
- Institute for Advanced Study, Chengdu University, Chengdu, China
| | - Xiuyue Zhang
- Key Laboratory of Bio-Resources and Eco-Environment, Ministry of Education, College of Life Science, Sichuan University, Chengdu, China
| | - Megan Price
- Key Laboratory of Bio-Resources and Eco-Environment, Ministry of Education, College of Life Science, Sichuan University, Chengdu, China
| | - Bisong Yue
- Key Laboratory of Bio-Resources and Eco-Environment, Ministry of Education, College of Life Science, Sichuan University, Chengdu, China
| | - Kelei Zhao
- Institute for Advanced Study, Chengdu University, Chengdu, China
| |
Collapse
|
24
|
Mura C, Chalupa M, Newbury AM, Chalupa J, Bourne PE. Ten simple rules for starting research in your late teens. PLoS Comput Biol 2020; 16:e1008403. [PMID: 33211694 PMCID: PMC7676678 DOI: 10.1371/journal.pcbi.1008403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Affiliation(s)
- Cameron Mura
- Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia, United States of America
- School of Data Science, University of Virginia, Charlottesville, Virginia, United States of America
- * E-mail: (CM); (PEB)
| | - Mike Chalupa
- City Neighbors Foundation, Baltimore, Maryland, United States of America
| | - Abigail M. Newbury
- Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia, United States of America
| | - Jack Chalupa
- City Neighbors Foundation, Baltimore, Maryland, United States of America
| | - Philip E. Bourne
- Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia, United States of America
- School of Data Science, University of Virginia, Charlottesville, Virginia, United States of America
- * E-mail: (CM); (PEB)
| |
Collapse
|
25
|
Covalent Versus Non-covalent Enzyme Inhibition: Which Route Should We Take? A Justification of the Good and Bad from Molecular Modelling Perspective. Protein J 2020; 39:97-105. [PMID: 32072438 DOI: 10.1007/s10930-020-09884-2] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
The pace and efficiency of drug target strategies have been emanating debates among researchers in the field of drug development. Covalent inhibitors possess significant advantages over non-covalent inhibitors, such that covalent warheads can target rare residues of a particular target protein, thus leading to the development of highly selective inhibitors. However, toxicity can be a real challenge related to this class of therapeutics. From the challenges of irreversible drug toxicity to the declining reactivity of reversible drugs, herein we provide justifications from the computational point of view. It was evident that both classes had its merits; however, with the increase in drug resistance, covalent inhibition seemed more suitable. There also seems to be enhanced selectivity of the covalent systems, proving its use as a therapeutic regimen worldwide. We believe that this study will assist researchers in making informed decisions on which drug class to choose as lead compounds in the drug discovery pipeline.
Collapse
|
26
|
Application of Systems Engineering Principles and Techniques in Biological Big Data Analytics: A Review. Processes (Basel) 2020. [DOI: 10.3390/pr8080951] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
In the past few decades, we have witnessed tremendous advancements in biology, life sciences and healthcare. These advancements are due in no small part to the big data made available by various high-throughput technologies, the ever-advancing computing power, and the algorithmic advancements in machine learning. Specifically, big data analytics such as statistical and machine learning has become an essential tool in these rapidly developing fields. As a result, the subject has drawn increased attention and many review papers have been published in just the past few years on the subject. Different from all existing reviews, this work focuses on the application of systems, engineering principles and techniques in addressing some of the common challenges in big data analytics for biological, biomedical and healthcare applications. Specifically, this review focuses on the following three key areas in biological big data analytics where systems engineering principles and techniques have been playing important roles: the principle of parsimony in addressing overfitting, the dynamic analysis of biological data, and the role of domain knowledge in biological data analytics.
Collapse
|
27
|
Workflow for Data Analysis in Experimental and Computational Systems Biology: Using Python as ‘Glue’. Processes (Basel) 2019. [DOI: 10.3390/pr7070460] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Bottom-up systems biology entails the construction of kinetic models of cellular pathways by collecting kinetic information on the pathway components (e.g., enzymes) and collating this into a kinetic model, based for example on ordinary differential equations. This requires integration and data transfer between a variety of tools, ranging from data acquisition in kinetics experiments, to fitting and parameter estimation, to model construction, evaluation and validation. Here, we present a workflow that uses the Python programming language, specifically the modules from the SciPy stack, to facilitate this task. Starting from raw kinetics data, acquired either from spectrophotometric assays with microtitre plates or from Nuclear Magnetic Resonance (NMR) spectroscopy time-courses, we demonstrate the fitting and construction of a kinetic model using scientific Python tools. The analysis takes place in a Jupyter notebook, which keeps all information related to a particular experiment together in one place and thus serves as an e-labbook, enhancing reproducibility and traceability. The Python programming language serves as an ideal foundation for this framework because it is powerful yet relatively easy to learn for the non-programmer, has a large library of scientific routines and active user community, is open-source and extensible, and many computational systems biology software tools are written in Python or have a Python Application Programming Interface (API). Our workflow thus enables investigators to focus on the scientific problem at hand rather than worrying about data integration between disparate platforms.
Collapse
|
28
|
Rakov AV, Mastriani E, Liu SL, Schifferli DM. Association of Salmonella virulence factor alleles with intestinal and invasive serovars. BMC Genomics 2019; 20:429. [PMID: 31138114 PMCID: PMC6540521 DOI: 10.1186/s12864-019-5809-8] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Accepted: 05/20/2019] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND The role of Salmonella virulence factor (VF) allelic variation in modulating pathogenesis or host specificity has only been demonstrated in a few cases, mostly through serendipitous findings. Virulence factor (VF) alleles from Salmonella enterica subsp. enterica genomes were compared to identify potential associations with the host-adapted invasive serovars Typhi, Dublin, Choleraesuis, and Gallinarum, and with the broad host-range intestinal serovars Typhimurium, Enteritidis, and Newport. RESULTS Through a bioinformatics analysis of 500 Salmonella genomes, we have identified allelic variants of 70 VFs, many of which are associated with either one of the four host-adapted invasive Salmonella serovars or one of the three broad host-range intestinal serovars. In addition, associations between specific VF alleles and intra-serovar clusters, sequence types (STs) and/or host-adapted FimH adhesins were identified. Moreover, new allelic VF associations with non-typhoidal S. Enteritidis and S. Typhimurium (NTS) or invasive NTS (iNTS) were detected. CONCLUSIONS By analogy to the previously shown association of specific FimH adhesin alleles with optimal binding by host adapted Salmonella serovars, lineages or strains, we predict that some of the identified association of other VF alleles with host-adapted serovars, lineages or strains will reflect specific contributions to host adaptation and/or pathogenesis. The identification of these allelic associations will support investigations of the biological impact of VF alleles and better characterize the role of allelic variation in Salmonella pathogenesis. Most relevant functional experiments will test the potential causal contribution of the detected FimH-associated VF variants in host adapted virulence.
Collapse
Affiliation(s)
- Alexey V. Rakov
- Department of Pathobiology, School of Veterinary Medicine, University of Pennsylvania, Philadelphia, Pennsylvania USA
- Present Address: Somov Institute of Epidemiology and Microbiology, Vladivostok, Russia
| | - Emilio Mastriani
- Systemomics Center, College of Pharmacy, Genomics Research Center, State-Province Key Laboratories of Biomedicine-Pharmaceutics of China, Harbin Medical University, Harbin, China
- HMU-UCCSM Centre for Infection and Genomics, Harbin Medical University, Harbin, China
| | - Shu-Lin Liu
- Systemomics Center, College of Pharmacy, Genomics Research Center, State-Province Key Laboratories of Biomedicine-Pharmaceutics of China, Harbin Medical University, Harbin, China
- HMU-UCCSM Centre for Infection and Genomics, Harbin Medical University, Harbin, China
- Department of Microbiology, Immunology and Infectious Diseases, University of Calgary, Calgary, Canada
| | - Dieter M. Schifferli
- Department of Pathobiology, School of Veterinary Medicine, University of Pennsylvania, Philadelphia, Pennsylvania USA
| |
Collapse
|
29
|
Affiliation(s)
| | - Cameron Mura
- Dept of Biomedical Engineering, University of Virginia, Charlottesville, Virginia, United States of America
| |
Collapse
|
30
|
Mariano D, Martins P, Helene Santos L, de Melo-Minardi RC. Introducing Programming Skills for Life Science Students. BIOCHEMISTRY AND MOLECULAR BIOLOGY EDUCATION : A BIMONTHLY PUBLICATION OF THE INTERNATIONAL UNION OF BIOCHEMISTRY AND MOLECULAR BIOLOGY 2019; 47:288-295. [PMID: 30860646 DOI: 10.1002/bmb.21230] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Revised: 01/18/2019] [Accepted: 02/18/2019] [Indexed: 05/04/2023]
Abstract
The advent of the high-throughput next-generation sequencing produced a large number of biological data. Knowledge discovery from the huge amount of available biological data requires researchers to develop solid skills in biology and computer science. As the majority of the Bioinformatics professionals are either computer science or life sciences graduates, to teach biology skills to computer science students and computational skills to life science students has become usual. In this article, we reported the experience of teaching programming for life science students. Our strategy is composed by explaining basic concepts of algorithms, abstraction of biological problems, and script programming using Python language. Based on the student's answers to an assessment questionnaire, we conclude that the course achieved positive results. They reported an improvement in their skills in programming and bioinformatics. Furthermore, the students approved the didactic adopted in the classes and evaluation methods (programming exercises and final presentation). This article is useful for other professors who want to implement an initial bioinformatics training for undergraduate or graduate students in life sciences. We believe that the strategies here demonstrated could be reproduced, which could help in the formation of a new generation of bioinformaticians with hybrid abilities in computation and biology. © 2019 International Union of Biochemistry and Molecular Biology, 47(3):288-295, 2019.
Collapse
Affiliation(s)
- Diego Mariano
- Laboratory of Bioinformatics and Systems (LBS), Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Pedro Martins
- Laboratory of Bioinformatics and Systems (LBS), Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Lucianna Helene Santos
- Laboratory of Bioinformatics and Systems (LBS), Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Raquel Cardoso de Melo-Minardi
- Laboratory of Bioinformatics and Systems (LBS), Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| |
Collapse
|
31
|
Mathema VB, Dondorp AM, Imwong M. OSTRFPD: Multifunctional Tool for Genome-Wide Short Tandem Repeat Analysis for DNA, Transcripts, and Amino Acid Sequences with Integrated Primer Designer. Evol Bioinform Online 2019; 15:1176934319843130. [PMID: 31040636 PMCID: PMC6482647 DOI: 10.1177/1176934319843130] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Accepted: 03/15/2019] [Indexed: 01/18/2023] Open
Abstract
Microsatellite mining is a common outcome of the in silico approach to genomic studies. The resulting short tandemly repeated DNA could be used as molecular markers for studying polymorphism, genotyping and forensics. The omni short tandem repeat finder and primer designer (OSTRFPD) is among the few versatile, platform-independent open-source tools written in Python that enables researchers to identify and analyse genome-wide short tandem repeats in both nucleic acids and protein sequences. OSTRFPD is designed to run either in a user-friendly fully featured graphical interface or in a command line interface mode for advanced users. OSTRFPD can detect both perfect and imperfect repeats of low complexity with customisable scores. Moreover, the software has built-in architecture to simultaneously filter selection of flanking regions in DNA and generate microsatellite-targeted primers implementing the Primer3 platform. The software has built-in motif-sequence generator engines and an additional option to use the dictionary mode for custom motif searches. The software generates search results including general statistics containing motif categorisation, repeat frequencies, densities, coverage, guanine–cytosine (GC) content, and simple text-based imperfect alignment visualisation. Thus, OSTRFPD presents users with a quick single-step solution package to assist development of microsatellite markers and categorise tandemly repeated amino acids in proteome databases. Practical implementation of OSTRFPD was demonstrated using publicly available whole-genome sequences of selected Plasmodium species. OSTRFPD is freely available and open-sourced for improvement and user-specific adaptation.
Collapse
Affiliation(s)
- Vivek Bhakta Mathema
- Department of Molecular Tropical Medicine and Genetics, Faculty of Tropical Medicine, Mahidol University, Bangkok, Thailand
| | - Arjen M Dondorp
- Mahidol-Oxford Tropical Medicine Research unit, Faculty of Tropical Medicine, Mahidol University, Bangkok, Thailand
- Centre for Tropical Medicine, Churchill Hospital, Oxford, UK
| | - Mallika Imwong
- Department of Molecular Tropical Medicine and Genetics, Faculty of Tropical Medicine, Mahidol University, Bangkok, Thailand
- Mallika Imwong, Department of Molecular Tropical Medicine and Genetics, Faculty of Tropical Medicine, Mahidol University, Bangkok 10400, Thailand.
| |
Collapse
|
32
|
Wang G, Peng B. Script of Scripts: A pragmatic workflow system for daily computational research. PLoS Comput Biol 2019; 15:e1006843. [PMID: 30811390 PMCID: PMC6411228 DOI: 10.1371/journal.pcbi.1006843] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Revised: 03/11/2019] [Accepted: 01/29/2019] [Indexed: 01/22/2023] Open
Abstract
Computationally intensive disciplines such as computational biology often require use of a variety of tools implemented in different scripting languages and analysis of large data sets using high-performance computing systems. Although scientific workflow systems can powerfully organize and execute large-scale data-analysis processes, creating and maintaining such workflows usually comes with nontrivial learning curves and engineering overhead, making them cumbersome to use for everyday data exploration and prototyping. To bridge the gap between interactive analysis and workflow systems, we developed Script of Scripts (SoS), an interactive data-analysis platform and workflow system with a strong emphasis on readability, practicality, and reproducibility in daily computational research. For exploratory analysis, SoS has a multilanguage scripting format that centralizes otherwise-scattered scripts and creates dynamic reports for publication and sharing. As a workflow engine, SoS provides an intuitive syntax for creating workflows in process-oriented, outcome-oriented, and mixed styles, as well as a unified interface for executing and managing tasks on a variety of computing platforms with automatic synchronization of files among isolated file systems. As illustrated herein by real-world examples, SoS is both an interactive analysis tool and pipeline platform suitable for different stages of method development and data-analysis projects. In particular, SoS can be easily adopted in existing data analysis routines to substantially improve organization, readability, and cross-platform computation management of research projects.
Collapse
Affiliation(s)
- Gao Wang
- Department of Human Genetics, The University of Chicago, Chicago, IL, United States of America
| | - Bo Peng
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, United States of America
- * E-mail:
| |
Collapse
|
33
|
Diaz-del-Pino S, Rodriguez-Brazzarola P, Perez-Wohlfeil E, Trelles O. Combining Strengths for Multi-genome Visual Analytics Comparison. Bioinform Biol Insights 2019; 13:1177932218825127. [PMID: 30783378 PMCID: PMC6365554 DOI: 10.1177/1177932218825127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2018] [Accepted: 12/22/2018] [Indexed: 11/25/2022] Open
Abstract
The eclosion of data acquisition technologies has shifted the bottleneck in molecular biology research from data acquisition to data analysis. Such is the case in Comparative Genomics, where sequence analysis has transitioned from genes to genomes of several orders of magnitude larger. This fact has revealed the need to adapt software to work with huge experiments efficiently and to incorporate new data-analysis strategies to manage results from such studies. In previous works, we presented GECKO, a software to compare large sequences; now we address the representation, browsing, data exploration, and post-processing of the massive amount of information derived from such comparisons. GECKO-MGV is a web-based application organized as client-server architecture. It is aimed at visual analysis of the results from both pairwise and multiple sequences comparison studies combining a set of common commands for image exploration with improved state-of-the-art solutions. In addition, GECKO-MGV integrates different visualization analysis tools while exploiting the concept of layers to display multiple genome comparison datasets. Moreover, the software is endowed with capabilities for contacting external-proprietary and third-party services for further data post-processing and also presents a method to display a timeline of large-scale evolutionary events. As proof-of-concept, we present 2 exercises using bacterial and mammalian genomes which depict the capabilities of GECKO-MGV to perform in-depth, customizable analyses on the fly using web technologies. The first exercise is mainly descriptive and is carried out over bacterial genomes, whereas the second one aims to show the ability to deal with large sequence comparisons. In this case, we display results from the comparison of the first Homo sapiens chromosome against the first 5 chromosomes of Mus musculus.
Collapse
Affiliation(s)
- Sergio Diaz-del-Pino
- Department of Computer Architecture, University of
Málaga and Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga,
Spain
| | - Pablo Rodriguez-Brazzarola
- Department of Computer Architecture, University of
Málaga and Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga,
Spain
| | - Esteban Perez-Wohlfeil
- Department of Computer Architecture, University of
Málaga and Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga,
Spain
| | - Oswaldo Trelles
- Department of Computer Architecture, University of
Málaga and Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga,
Spain
| |
Collapse
|
34
|
Erickson RA, Fienen MN, McCalla SG, Weiser EL, Bower ML, Knudson JM, Thain G. Wrangling distributed computing for high-throughput environmental science: An introduction to HTCondor. PLoS Comput Biol 2018; 14:e1006468. [PMID: 30281592 PMCID: PMC6169842 DOI: 10.1371/journal.pcbi.1006468] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Biologists and environmental scientists now routinely solve computational problems that were unimaginable a generation ago. Examples include processing geospatial data, analyzing -omics data, and running large-scale simulations. Conventional desktop computing cannot handle these tasks when they are large, and high-performance computing is not always available nor the most appropriate solution for all computationally intense problems. High-throughput computing (HTC) is one method for handling computationally intense research. In contrast to high-performance computing, which uses a single "supercomputer," HTC can distribute tasks over many computers (e.g., idle desktop computers, dedicated servers, or cloud-based resources). HTC facilities exist at many academic and government institutes and are relatively easy to create from commodity hardware. Additionally, consortia such as Open Science Grid facilitate HTC, and commercial entities sell cloud-based solutions for researchers who lack HTC at their institution. We provide an introduction to HTC for biologists and environmental scientists. Our examples from biology and the environmental sciences use HTCondor, an open source HTC system. Computational biology often requires processing large amounts of data, running many simulations, or other computationally intensive tasks. In this hybrid primer/tutorial, we describe how high-throughput computing (HTC) can be used to solve these problems. First, we present an overview of high-throughput computing. Second, we describe how to break jobs down so that they can run with HTC. Third, we describe how to use HTCondor software as a method for HTC. Fourth, we describe how HTCondor may be applied to other situations and a series of online tutorials.
Collapse
Affiliation(s)
- Richard A. Erickson
- Upper Midwest Environmental Sciences Center, United States Geological Survey, La Crosse, Wisconsin, United States of America
- * E-mail:
| | - Michael N. Fienen
- Wisconsin Water Science Center, United States Geological Survey, Middelton, Wisconsin, United States of America
| | - S. Grace McCalla
- Upper Midwest Environmental Sciences Center, United States Geological Survey, La Crosse, Wisconsin, United States of America
| | - Emily L. Weiser
- Upper Midwest Environmental Sciences Center, United States Geological Survey, La Crosse, Wisconsin, United States of America
| | - Melvin L. Bower
- Upper Midwest Environmental Sciences Center, United States Geological Survey, La Crosse, Wisconsin, United States of America
| | - Jonathan M. Knudson
- Upper Midwest Environmental Sciences Center, United States Geological Survey, La Crosse, Wisconsin, United States of America
| | - Greg Thain
- Department of Computer Science, University of Wisconsin–Madison, Madison, Winconsin, United States of America
| |
Collapse
|
35
|
Garcia-Milian R, Hersey D, Vukmirovic M, Duprilot F. Data challenges of biomedical researchers in the age of omics. PeerJ 2018; 6:e5553. [PMID: 30221093 PMCID: PMC6138043 DOI: 10.7717/peerj.5553] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2018] [Accepted: 08/10/2018] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND High-throughput technologies are rapidly generating large amounts of diverse omics data. Although this offers a great opportunity, it also poses great challenges as data analysis becomes more complex. The purpose of this study was to identify the main challenges researchers face in analyzing data, and how academic libraries can support them in this endeavor. METHODS A multimodal needs assessment analysis combined an online survey sent to 860 Yale-affiliated researchers (176 responded) and 15 in-depth one-on-one semi-structured interviews. Interviews were recorded, transcribed, and analyzed using NVivo 10 software according to the thematic analysis approach. RESULTS The survey response rate was 20%. Most respondents (78%) identified lack of adequate data analysis training (e.g., R, Python) as a main challenge, in addition to not having the proper database or software (54%) to expedite analysis. Two main themes emerged from the interviews: personnel and training needs. Researchers feel they could improve data analyses practices by having better access to the appropriate bioinformatics expertise, and/or training in data analyses tools. They also reported lack of time to acquire expertise in using bioinformatics tools and poor understanding of the resources available to facilitate analysis. CONCLUSIONS The main challenges identified by our study are: lack of adequate training for data analysis (including need to learn scripting language), need for more personnel at the University to provide data analysis and training, and inadequate communication between bioinformaticians and researchers. The authors identified the positive impact of medical and/or science libraries by establishing bioinformatics support to researchers.
Collapse
Affiliation(s)
- Rolando Garcia-Milian
- Bioinformatics Support Program, Research and Education Services, Cushing/Whitney Medical Library, Yale University, New Haven, CT, United States of America
| | - Denise Hersey
- Science Libraries, Lewis Science Library, Princeton University, Princeton, NJ, United States of America
| | - Milica Vukmirovic
- Pulmonary Critical Care & Sleep Medicine, Yale School of Medicine, Yale University, New Haven, CT, United States of America
| | - Fanny Duprilot
- Service commun de la documentation, Université Denis Diderot (Paris VII), Paris, France
| |
Collapse
|
36
|
Gauthier J, Vincent AT, Charette SJ, Derome N. A brief history of bioinformatics. Brief Bioinform 2018; 20:1981-1996. [DOI: 10.1093/bib/bby063] [Citation(s) in RCA: 59] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2018] [Revised: 06/22/2018] [Indexed: 02/06/2023] Open
Abstract
AbstractIt is easy for today’s students and researchers to believe that modern bioinformatics emerged recently to assist next-generation sequencing data analysis. However, the very beginnings of bioinformatics occurred more than 50 years ago, when desktop computers were still a hypothesis and DNA could not yet be sequenced. The foundations of bioinformatics were laid in the early 1960s with the application of computational methods to protein sequence analysis (notably, de novo sequence assembly, biological sequence databases and substitution models). Later on, DNA analysis also emerged due to parallel advances in (i) molecular biology methods, which allowed easier manipulation of DNA, as well as its sequencing, and (ii) computer science, which saw the rise of increasingly miniaturized and more powerful computers, as well as novel software better suited to handle bioinformatics tasks. In the 1990s through the 2000s, major improvements in sequencing technology, along with reduced costs, gave rise to an exponential increase of data. The arrival of ‘Big Data’ has laid out new challenges in terms of data mining and management, calling for more expertise from computer science into the field. Coupled with an ever-increasing amount of bioinformatics tools, biological Big Data had (and continues to have) profound implications on the predictive power and reproducibility of bioinformatics results. To overcome this issue, universities are now fully integrating this discipline into the curriculum of biology students. Recent subdisciplines such as synthetic biology, systems biology and whole-cell modeling have emerged from the ever-increasing complementarity between computer science and biology.
Collapse
Affiliation(s)
- Jeff Gauthier
- Institut de Biologie Intégrative et des Systèmes (IBIS), Département de Biologie, Université Laval, 1030, av. de la Médecine, Québec, Canada
| | - Antony T Vincent
- INRS-Institut Armand-Frappier, Bacterial Symbionts Evolution, 531 boul. des Prairies, Laval, QC, Canada
| | - Steve J Charette
- Centre de Recherche de l'Institut, Universitaire de Cardiologie et de Pneumologie de Québec (CRIUCPQ), 2725 Chemin Sainte-Foy, Québec, QC, Canada
- Département de Biochimie, de Microbiologie et de Bio-informatique, Université Laval, Québec, Canada
| | - Nicolas Derome
- Institut de Biologie Intégrative et des Systèmes (IBIS), Département de Biologie, Université Laval, 1030, av. de la Médecine, Québec, Canada
| |
Collapse
|
37
|
Smith JK, Jiang S, Pfaendtner J. Redefining the Protein-Protein Interface: Coarse Graining and Combinatorics for an Improved Understanding of Amino Acid Contributions to the Protein-Protein Binding Affinity. LANGMUIR : THE ACS JOURNAL OF SURFACES AND COLLOIDS 2017; 33:11511-11517. [PMID: 28850233 DOI: 10.1021/acs.langmuir.7b02438] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The ability to intervene in biological pathways has for decades been limited by the lack of a quantitative description of protein-protein interactions (PPIs). Herein we generate and compare millions of simple PPI models for insight into the mechanisms of specific recognition and binding. We use a coarse-grained approach whereby amino acids are counted in the interface, and these counts are used as binding affinity predictors. We perform lasso regression, a modern regression technique aimed at interpretability, with every possible amino acid combination (over 106 unique feature sets) to select only those amino acid predictors that provide more information than noise. This approach circumvents arbitrary binning and assumptions about the binding environment that obscure other binding affinity models. Aggregated analysis of these models trained at various interfacial cutoff distances informs the roles of specific amino acids in different binding contexts. We find that a simple amino acid count model outperforms detailed intermolecular contact and binned residue type models. We identify the prevalence of serine, glycine, and tryptophan in the interface as particularly important for predicting binding affinity across a range of distance cutoffs. Although current sample size limitations prevent a robust consensus model for binding affinity prediction, our approach underscores the relevance of a residue-based description of the protein-protein interface to increase our understanding of specific interactions.
Collapse
Affiliation(s)
- Josh K Smith
- Department of Chemical Engineering, University of Washington , Seattle, Washington 98195, United States
| | - Shaoyi Jiang
- Department of Chemical Engineering, University of Washington , Seattle, Washington 98195, United States
| | - Jim Pfaendtner
- Department of Chemical Engineering, University of Washington , Seattle, Washington 98195, United States
| |
Collapse
|
38
|
Picard V, Mulner-Lorillon O, Bourdon J, Morales J, Cormier P, Siegel A, Bellé R. Model of the delayed translation of cyclin B maternal mRNA after sea urchin fertilization. Mol Reprod Dev 2016; 83:1070-1082. [PMID: 27699901 DOI: 10.1002/mrd.22746] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2016] [Accepted: 10/01/2016] [Indexed: 01/24/2023]
Abstract
Sea urchin eggs exhibit a cap-dependent increase in protein synthesis within minutes after fertilization. This rise in protein synthesis occurs at a constant rate for a great number of proteins translated from the different available mRNAs. Surprisingly, we found that cyclin B, a major cell-cycle regulator, follows a synthesis pattern that is distinct from the global protein population, so we developed a mathematical model to analyze this dissimilarity in biosynthesis kinetic patterns. The model includes two pathways for cyclin B mRNA entry into the translational machinery: one from immediately available mRNA (mRNAcyclinB) and one from mRNA activated solely after fertilization (XXmRNAcyclinB). Two coefficients, α and β, were added to fit the measured scales of global protein and cyclin B synthesis, respectively. The model was simplified to identify the synthesis parameters and to allow its simulation. The calculated parameters for activation of the specific cyclin B synthesis pathway after fertilization included a kinetic constant (ka ) of 0.024 sec-1 , for the activation of XXmRNAcyclinB, and a critical time interval (t2 ) of 42 min. The proportion of XXmRNAcyclinB form was also calculated to be largely dominant over the mRNAcyclinB form. Regulation of cyclin B biosynthesis is an example of a select protein whose translation is controlled by pathways that are distinct from housekeeping proteins, even though both involve the same cap-dependent initiation pathway. Therefore, this model should help provide insight to the signaling utilized for the biosynthesis of cyclin B and other select proteins. Mol. Reprod. Dev. 83: 1070-1082, 2016. © 2016 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Vincent Picard
- CNRS UMR 6241, Laboratoire LINA, Université de Nantes, Nantes, France.,CNRS, IRISA-UMR 6074, Campus de Beaulieu, Rennes, France.,INRIA, Centre Rennes-Bretagne Atlantique, Symbiose, Campus de Beaulieu, Rennes, France
| | - Odile Mulner-Lorillon
- Sorbonne Universités, UPMC Univ Paris 06, UMR 8227, Integrative Biology of Marine Models, Translation Cell Cycle and Development, Station Biologique de Roscoff, Roscoff Cedex, France.,CNRS, UMR 8227, Integrative Biology of Marine Models, Translation Cell Cycle and Development, Station Biologique de Roscoff, Roscoff Cedex, France
| | - Jérémie Bourdon
- CNRS UMR 6241, Laboratoire LINA, Université de Nantes, Nantes, France
| | - Julia Morales
- Sorbonne Universités, UPMC Univ Paris 06, UMR 8227, Integrative Biology of Marine Models, Translation Cell Cycle and Development, Station Biologique de Roscoff, Roscoff Cedex, France.,CNRS, UMR 8227, Integrative Biology of Marine Models, Translation Cell Cycle and Development, Station Biologique de Roscoff, Roscoff Cedex, France
| | - Patrick Cormier
- Sorbonne Universités, UPMC Univ Paris 06, UMR 8227, Integrative Biology of Marine Models, Translation Cell Cycle and Development, Station Biologique de Roscoff, Roscoff Cedex, France.,CNRS, UMR 8227, Integrative Biology of Marine Models, Translation Cell Cycle and Development, Station Biologique de Roscoff, Roscoff Cedex, France
| | - Anne Siegel
- CNRS, IRISA-UMR 6074, Campus de Beaulieu, Rennes, France.,INRIA, Centre Rennes-Bretagne Atlantique, Symbiose, Campus de Beaulieu, Rennes, France
| | - Robert Bellé
- Sorbonne Universités, UPMC Univ Paris 06, UMR 8227, Integrative Biology of Marine Models, Translation Cell Cycle and Development, Station Biologique de Roscoff, Roscoff Cedex, France.,CNRS, UMR 8227, Integrative Biology of Marine Models, Translation Cell Cycle and Development, Station Biologique de Roscoff, Roscoff Cedex, France
| |
Collapse
|