1
|
Scapolatiello A, Boscari E, Schiavon L, Vitulo N, Congiu L. Intronomics-MIP: a snakemake pipeline for analyzing multilocus intron polymorphisms in species identification and population genomics. BMC Res Notes 2025; 18:203. [PMID: 40325430 PMCID: PMC12054228 DOI: 10.1186/s13104-025-07264-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Accepted: 04/22/2025] [Indexed: 05/07/2025] Open
Abstract
In this Research Note, we introduce Intronomics-MIP, a snakemake-based pipeline for the automated analysis of multi-locus intron polymorphisms (MIPs) using intron-targeted amplicon sequencing. Building on established methodologies, our pipeline integrates tools such as Cutadapt, FLASH, and SeekDeep to efficiently process and analyze highly variable intron regions. These MIPs serve as powerful multiple-allelic markers, primarily useful for distinguishing species, identifying cryptic species, disentangling species complexes and detecting hybridization, but can also be informative for assessing population structure without prior species knowledge. Our pipeline enhances reproducibility and scalability, making it adaptable to a wide range of taxa, with a specific demonstration on teleost species. We provide a comprehensive overview of the pipeline's design, along with performance assessments using representative datasets.
Collapse
Affiliation(s)
- A Scapolatiello
- Department of Biology, University of Padova, Via Ugo Bassi 58B, 35121, Padua, Italy.
| | - E Boscari
- Department of Biology, University of Padova, Via Ugo Bassi 58B, 35121, Padua, Italy
| | - L Schiavon
- Department of Biology, University of Padova, Via Ugo Bassi 58B, 35121, Padua, Italy
| | - N Vitulo
- Department of Biotechnology, University of Verona, Strada le Grazie, 15, 37134, Verona, Italy
| | - L Congiu
- Department of Biology, University of Padova, Via Ugo Bassi 58B, 35121, Padua, Italy
- Consorzio Nazionale Interuniversitario Per le Scienze del Mare (CoNISMa), Piazzale Flaminio 9, 00196, Rome, Italy
- National Biodiversity Future Center, Palermo, Italy
| |
Collapse
|
2
|
Saparov A, Zech M. Big data and transformative bioinformatics in genomic diagnostics and beyond. Parkinsonism Relat Disord 2025; 134:107311. [PMID: 39924354 DOI: 10.1016/j.parkreldis.2025.107311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/11/2024] [Revised: 01/23/2025] [Accepted: 01/25/2025] [Indexed: 02/11/2025]
Abstract
The current era of high-throughput analysis-driven research offers invaluable insights into disease etiologies, accurate diagnostics, pathogenesis, and personalized therapy. In the field of movement disorders, investigators are facing an increasing growth in the volume of produced patient-derived datasets, providing substantial opportunities for precision medicine approaches based on extensive information accessibility and advanced annotation practices. Integrating data from multiple sources, including phenomics, genomics, and multi-omics, is crucial for comprehensively understanding different types of movement disorders. Here, we explore formats and analytics of big data generated for patients with movement disorders, including strategies to meaningfully share the data for optimized patient benefit. We review computational methods that are essential to accelerate the process of evaluating the increasing amounts of specialized data collected. Based on concrete examples, we highlight how bioinformatic approaches facilitate the translation of multidimensional biological information into clinically relevant knowledge. Moreover, we outline the feasibility of computer-aided therapeutic target evaluation, and we discuss the importance of expanding the focus of big data research to understudied phenotypes such as dystonia.
Collapse
Affiliation(s)
- Alice Saparov
- Institute of Human Genetics, Technical University of Munich, School of Medicine and Health, Munich, Germany; Institute of Neurogenomics, Helmholtz Munich, Neuherberg, Germany; Institute for Advanced Study, Technical University of Munich, Garching, Germany
| | - Michael Zech
- Institute of Human Genetics, Technical University of Munich, School of Medicine and Health, Munich, Germany; Institute of Neurogenomics, Helmholtz Munich, Neuherberg, Germany; Institute for Advanced Study, Technical University of Munich, Garching, Germany.
| |
Collapse
|
3
|
Goclowski CL, Jakiela J, Collins T, Hiltemann S, Howells M, Loach M, Manning J, Moreno P, Ostrovsky A, Rasche H, Tekman M, Tyson G, Videm P, Bacon W. Galaxy as a gateway to bioinformatics: Multi-Interface Galaxy Hands-on Training Suite (MIGHTS) for scRNA-seq. Gigascience 2025; 14:giae107. [PMID: 39775842 PMCID: PMC11707610 DOI: 10.1093/gigascience/giae107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2024] [Revised: 10/28/2024] [Accepted: 11/26/2024] [Indexed: 01/11/2025] Open
Abstract
BACKGROUND Bioinformatics is fundamental to biomedical sciences, but its mastery presents a steep learning curve for bench biologists and clinicians. Learning to code while analyzing data is difficult. The curve may be flattened by separating these two aspects and providing intermediate steps for budding bioinformaticians. Single-cell analysis is in great demand from biologists and biomedical scientists, as evidenced by the proliferation of training events, materials, and collaborative global efforts like the Human Cell Atlas. However, iterative analyses lacking reinstantiation, coupled with unstandardized pipelines, have made effective single-cell training a moving target. FINDINGS To address these challenges, we present a Multi-Interface Galaxy Hands-on Training Suite (MIGHTS) for single-cell RNA sequencing (scRNA-seq) analysis, which offers parallel analytical methods using a graphical interface (buttons) or code. With clear, interoperable materials, MIGHTS facilitates smooth transitions between environments. Bridging the biologist-programmer gap, MIGHTS emphasizes interdisciplinary communication for effective learning at all levels. Real-world data analysis in MIGHTS promotes critical thinking and best practices, while FAIR data principles ensure validation of results. MIGHTS is freely available, hosted on the Galaxy Training Network, and leverages Galaxy interfaces for analyses in both settings. Given the ongoing popularity of Python-based (Scanpy) and R-based (Seurat & Monocle) scRNA-seq analyses, MIGHTS enables analyses using both. CONCLUSIONS MIGHTS consists of 11 tutorials, including recordings, slide decks, and interactive visualizations, and a demonstrated track record of sustainability via regular updates and community collaborations. Parallel pathways in MIGHTS enable concurrent training of scientists at any programming level, addressing the heterogeneous needs of novice bioinformaticians.
Collapse
Affiliation(s)
- Camila L Goclowski
- Eccles Institute of Human Genetics, University of Utah, Salt Lake City, UT, 84112, USA
| | - Julia Jakiela
- School of Chemistry, University of Edinburgh, Edinburgh, EH9 3FJ, UK
| | - Tyler Collins
- Department of Computer Science, John Hopkins Medical Institution, Baltimore, MD, 21224, USA
| | - Saskia Hiltemann
- Erasmus Medical Center, Rotterdam, Zuid-Holland, 3015 GD, Netherlands
| | - Morgan Howells
- School of Computing & Communications, The Open University, Milton Keynes, Buckinghamshire, MK7 6AA, UK
| | - Marisa Loach
- School of Life, Health & Chemical Sciences, The Open University, Milton Keynes, Buckinghamshire, MK7 6AA, UK
| | - Jonathan Manning
- European Bioinformatics Institute, European Molecular Biology Laboratory, Hinxton, CB10 1SD, UK
| | - Pablo Moreno
- Early Computational Oncology, AstraZeneca, Cambridge, CB2 0AA, UK
| | - Alex Ostrovsky
- Department of Computer Science, John Hopkins Medical Institution, Baltimore, MD, 21224, USA
| | - Helena Rasche
- Erasmus Medical Center, Rotterdam, Zuid-Holland, 3015 GD, Netherlands
| | - Mehmet Tekman
- Division of Pharmacology and Toxicology, University of Freiburg, Freiburg im Breisgau, Baden-Württemberg, 79098, Germany
| | - Graeme Tyson
- School of Life, Health & Chemical Sciences, The Open University, Milton Keynes, Buckinghamshire, MK7 6AA, UK
| | - Pavankumar Videm
- Department of Computer Science, University of Freiburg, Freiburg im Breisgau,Baden-Württemberg, 79098, Germany
| | - Wendi Bacon
- School of Life, Health & Chemical Sciences, The Open University, Milton Keynes, Buckinghamshire, MK7 6AA, UK
| |
Collapse
|
4
|
Wang Y, O'Connor K, Flores I, Berdahl CT, Urbanowicz RJ, Stevens R, Bauermeister JA, Gonzalez-Hernandez G. Mpox Discourse on Twitter by Sexual Minority Men and Gender-Diverse Individuals: Infodemiological Study Using BERTopic. JMIR Public Health Surveill 2024; 10:e59193. [PMID: 39137013 PMCID: PMC11350314 DOI: 10.2196/59193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 06/08/2024] [Accepted: 07/17/2024] [Indexed: 08/15/2024] Open
Abstract
BACKGROUND The mpox outbreak resulted in 32,063 cases and 58 deaths in the United States and 95,912 cases worldwide from May 2022 to March 2024 according to the US Centers for Disease Control and Prevention (CDC). Like other disease outbreaks (eg, HIV) with perceived community associations, mpox can create the risk of stigma, exacerbate homophobia, and potentially hinder health care access and social equity. However, the existing literature on mpox has limited representation of the perspective of sexual minority men and gender-diverse (SMMGD) individuals. OBJECTIVE To fill this gap, this study aimed to synthesize themes of discussions among SMMGD individuals and listen to SMMGD voices for identifying problems in current public health communication surrounding mpox to improve inclusivity, equity, and justice. METHODS We analyzed mpox-related posts (N=8688) posted between October 2020 and September 2022 by 2326 users who self-identified on Twitter/X as SMMGD and were geolocated in the United States. We applied BERTopic (a topic-modeling technique) on the tweets, validated the machine-generated topics through human labeling and annotations, and conducted content analysis of the tweets in each topic. Geographic analysis was performed on the size of the most prominent topic across US states in relation to the University of California, Los Angeles (UCLA) lesbian, gay, and bisexual (LGB) social climate index. RESULTS BERTopic identified 11 topics, which annotators labeled as mpox health activism (n=2590, 29.81%), mpox vaccination (n=2242, 25.81%), and adverse events (n=85, 0.98%); sarcasm, jokes, and emotional expressions (n=1220, 14.04%); COVID-19 and mpox (n=636, 7.32%); government or public health response (n=532, 6.12%); mpox symptoms (n=238, 2.74%); case reports (n=192, 2.21%); puns on the naming of the virus (ie, mpox; n=75, 0.86%); media publicity (n=59, 0.68%); and mpox in children (n=58, 0.67%). Spearman rank correlation indicated significant negative correlation (ρ=-0.322, P=.03) between the topic size of health activism and the UCLA LGB social climate index at the US state level. CONCLUSIONS Discussions among SMMGD individuals on mpox encompass both utilitarian (eg, vaccine access, case reports, and mpox symptoms) and emotionally charged (ie, promoting awareness, advocating against homophobia, misinformation/disinformation, and health stigma) themes. Mpox health activism is more prevalent in US states with lower LGB social acceptance, suggesting a resilient communicative pattern among SMMGD individuals in the face of public health oppression. Our method for social listening could facilitate future public health efforts, providing a cost-effective way to capture the perspective of impacted populations. This study illuminates SMMGD engagement with the mpox discourse, underscoring the need for more inclusive public health programming. Findings also highlight the social impact of mpox: health stigma. Our findings could inform interventions to optimize the delivery of informational and tangible health resources leveraging computational mixed-method analyses (eg, BERTopic) and big data.
Collapse
Affiliation(s)
- Yunwen Wang
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, United States
- William Allen White School of Journalism and Mass Communications, University of Kansas, Lawrence, KS, United States
| | - Karen O'Connor
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Ivan Flores
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, United States
| | - Carl T Berdahl
- Departments of Medicine and Emergency Medicine, Cedars-Sinai Medical Center, West Hollywood, CA, United States
| | - Ryan J Urbanowicz
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, United States
| | - Robin Stevens
- Annenberg School for Communication and Journalism, University of Southern California, Los Angeles, CA, United States
| | - José A Bauermeister
- Department of Family and Community Health, School of Nursing, University of Pennsylvania, Philadelphia, PA, United States
| | | |
Collapse
|
5
|
Danzi F, Pacchiana R, Mafficini A, Scupoli MT, Scarpa A, Donadelli M, Fiore A. To metabolomics and beyond: a technological portfolio to investigate cancer metabolism. Signal Transduct Target Ther 2023; 8:137. [PMID: 36949046 PMCID: PMC10033890 DOI: 10.1038/s41392-023-01380-0] [Citation(s) in RCA: 79] [Impact Index Per Article: 39.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 02/08/2023] [Accepted: 02/15/2023] [Indexed: 03/24/2023] Open
Abstract
Tumour cells have exquisite flexibility in reprogramming their metabolism in order to support tumour initiation, progression, metastasis and resistance to therapies. These reprogrammed activities include a complete rewiring of the bioenergetic, biosynthetic and redox status to sustain the increased energetic demand of the cells. Over the last decades, the cancer metabolism field has seen an explosion of new biochemical technologies giving more tools than ever before to navigate this complexity. Within a cell or a tissue, the metabolites constitute the direct signature of the molecular phenotype and thus their profiling has concrete clinical applications in oncology. Metabolomics and fluxomics, are key technological approaches that mainly revolutionized the field enabling researchers to have both a qualitative and mechanistic model of the biochemical activities in cancer. Furthermore, the upgrade from bulk to single-cell analysis technologies provided unprecedented opportunity to investigate cancer biology at cellular resolution allowing an in depth quantitative analysis of complex and heterogenous diseases. More recently, the advent of functional genomic screening allowed the identification of molecular pathways, cellular processes, biomarkers and novel therapeutic targets that in concert with other technologies allow patient stratification and identification of new treatment regimens. This review is intended to be a guide for researchers to cancer metabolism, highlighting current and emerging technologies, emphasizing advantages, disadvantages and applications with the potential of leading the development of innovative anti-cancer therapies.
Collapse
Affiliation(s)
- Federica Danzi
- Department of Neurosciences, Biomedicine and Movement Sciences, Section of Biochemistry, University of Verona, Verona, Italy
| | - Raffaella Pacchiana
- Department of Neurosciences, Biomedicine and Movement Sciences, Section of Biochemistry, University of Verona, Verona, Italy
| | - Andrea Mafficini
- Department of Diagnostics and Public Health, University of Verona, Verona, Italy
| | - Maria T Scupoli
- Department of Neurosciences, Biomedicine and Movement Sciences, Biology and Genetics Section, University of Verona, Verona, Italy
| | - Aldo Scarpa
- Department of Diagnostics and Public Health, University of Verona, Verona, Italy
- ARC-NET Research Centre, University and Hospital Trust of Verona, Verona, Italy
| | - Massimo Donadelli
- Department of Neurosciences, Biomedicine and Movement Sciences, Section of Biochemistry, University of Verona, Verona, Italy.
| | - Alessandra Fiore
- Department of Neurosciences, Biomedicine and Movement Sciences, Section of Biochemistry, University of Verona, Verona, Italy
| |
Collapse
|
6
|
Niu YN, Roberts EG, Denisko D, Hoffman MM. Assessing and assuring interoperability of a genomics file format. Bioinformatics 2022; 38:3327-3336. [PMID: 35575355 PMCID: PMC9237710 DOI: 10.1093/bioinformatics/btac327] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Revised: 03/30/2022] [Accepted: 05/11/2022] [Indexed: 12/01/2022] Open
Abstract
Motivation Bioinformatics software tools operate largely through the use of specialized genomics file formats. Often these formats lack formal specification, making it difficult or impossible for the creators of these tools to robustly test them for correct handling of input and output. This causes problems in interoperability between different tools that, at best, wastes time and frustrates users. At worst, interoperability issues could lead to undetected errors in scientific results. Results We developed a new verification system, Acidbio, which tests for correct behavior in bioinformatics software packages. We crafted tests to unify correct behavior when tools encounter various edge cases—potentially unexpected inputs that exemplify the limits of the format. To analyze the performance of existing software, we tested the input validation of 80 Bioconda packages that parsed the Browser Extensible Data (BED) format. We also used a fuzzing approach to automatically perform additional testing. Of 80 software packages examined, 75 achieved less than 70% correctness on our test suite. We categorized multiple root causes for the poor performance of different types of software. Fuzzing detected other errors that the manually designed test suite could not. We also created a badge system that developers can use to indicate more precisely which BED variants their software accepts and to advertise the software’s performance on the test suite. Availability and implementation Acidbio is available at https://github.com/hoffmangroup/acidbio. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yi Nian Niu
- Princess Margaret Cancer Centre University Health Network, Toronto, ON, M5G 2C1, Canada
| | - Eric G Roberts
- Princess Margaret Cancer Centre University Health Network, Toronto, ON, M5G 2C1, Canada
| | - Danielle Denisko
- Princess Margaret Cancer Centre University Health Network, Toronto, ON, M5G 2C1, Canada.,Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada
| | - Michael M Hoffman
- Princess Margaret Cancer Centre University Health Network, Toronto, ON, M5G 2C1, Canada.,Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada.,Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada.,Vector Institute, Toronto, ON, M5G 1M1, Canada
| |
Collapse
|
7
|
Noor A. Improving bioinformatics software quality through incorporation of software engineering practices. PeerJ Comput Sci 2022; 8:e839. [PMID: 35111923 PMCID: PMC8771759 DOI: 10.7717/peerj-cs.839] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 12/13/2021] [Indexed: 06/14/2023]
Abstract
BACKGROUND Bioinformatics software is developed for collecting, analyzing, integrating, and interpreting life science datasets that are often enormous. Bioinformatics engineers often lack the software engineering skills necessary for developing robust, maintainable, reusable software. This study presents review and discussion of the findings and efforts made to improve the quality of bioinformatics software. METHODOLOGY A systematic review was conducted of related literature that identifies core software engineering concepts for improving bioinformatics software development: requirements gathering, documentation, testing, and integration. The findings are presented with the aim of illuminating trends within the research that could lead to viable solutions to the struggles faced by bioinformatics engineers when developing scientific software. RESULTS The findings suggest that bioinformatics engineers could significantly benefit from the incorporation of software engineering principles into their development efforts. This leads to suggestion of both cultural changes within bioinformatics research communities as well as adoption of software engineering disciplines into the formal education of bioinformatics engineers. Open management of scientific bioinformatics development projects can result in improved software quality through collaboration amongst both bioinformatics engineers and software engineers. CONCLUSIONS While strides have been made both in identification and solution of issues of particular import to bioinformatics software development, there is still room for improvement in terms of shifts in both the formal education of bioinformatics engineers as well as the culture and approaches of managing scientific bioinformatics research and development efforts.
Collapse
|
8
|
Veltri P. Guest Editorial Innovative Data Analysis Methods for Biomedicine. IEEE J Biomed Health Inform 2021. [DOI: 10.1109/jbhi.2021.3116336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
9
|
Wratten L, Wilm A, Göke J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 2021; 18:1161-1168. [PMID: 34556866 DOI: 10.1038/s41592-021-01254-9] [Citation(s) in RCA: 70] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 07/29/2021] [Indexed: 02/08/2023]
Abstract
The rapid growth of high-throughput technologies has transformed biomedical research. With the increasing amount and complexity of data, scalability and reproducibility have become essential not just for experiments, but also for computational analysis. However, transforming data into information involves running a large number of tools, optimizing parameters, and integrating dynamically changing reference data. Workflow managers were developed in response to such challenges. They simplify pipeline development, optimize resource usage, handle software installation and versions, and run on different compute platforms, enabling workflow portability and sharing. In this Perspective, we highlight key features of workflow managers, compare commonly used approaches for bioinformatics workflows, and provide a guide for computational and noncomputational users. We outline community-curated pipeline initiatives that enable novice and experienced users to perform complex, best-practice analyses without having to manually assemble workflows. In sum, we illustrate how workflow managers contribute to making computational analysis in biomedical research shareable, scalable, and reproducible.
Collapse
Affiliation(s)
| | | | - Jonathan Göke
- Genome Institute of Singapore, Singapore, Singapore.
| |
Collapse
|
10
|
Moreira Souza A, Weigert RDAS, Machado de Sousa EP, Tassoni Andrietta L, Ventura RV. Practical implications of using non-relational databases to store large genomic data files and novel phenotypes. J Anim Breed Genet 2021; 139:100-112. [PMID: 34459042 DOI: 10.1111/jbg.12644] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Revised: 07/14/2021] [Accepted: 08/08/2021] [Indexed: 11/30/2022]
Abstract
The objective of our study was to provide practical directions on the storage of genomic information and novel phenotypes (treated here as unstructured data) using a non-relational database. The MongoDB technology was assessed for this purpose, enabling frequent data transactions involving numerous individuals under genetic evaluation. Our study investigated different genomic (Illumina Final Report, PLINK, 0125, FASTQ, and VCF formats) and phenotypic (including media files) information, using both real and simulated datasets. Advantages of our centralized database concept include the sublinear running time for queries after increasing the number of samples/markers exponentially, in addition to the comprehensive management of distinct data formats while searching for specific genomic regions. A comparison of our non-relational and generic solution, with an existing relational approach (developed for tabular data types using 2 bits to store genotypes), showed reduced importing time to handle 50M SNPs (PLINK format) achieved by the relational schema. Our experimental results also reinforce that data conversion is a costly step required to manage genomic data into both relational and non-relational database systems, and therefore, must be carefully treated for large applications.
Collapse
Affiliation(s)
- André Moreira Souza
- Institute of Mathematics and Computer Sciences, University of Sao Paulo, Sao Carlos, Sao Paulo, Brazil
| | | | | | - Lucas Tassoni Andrietta
- Department of Animal Nutrition and Production, School of Veterinary Medicine and Animal Science, University of Sao Paulo, Pirassununga, Sao Paulo, Brazil
| | - Ricardo Vieira Ventura
- Department of Animal Nutrition and Production, School of Veterinary Medicine and Animal Science, University of Sao Paulo, Pirassununga, Sao Paulo, Brazil
| |
Collapse
|
11
|
Hanussek M, Bartusch F, Krüger J. Performance and scaling behavior of bioinformatic applications in virtualization environments to create awareness for the efficient use of compute resources. PLoS Comput Biol 2021; 17:e1009244. [PMID: 34283824 PMCID: PMC8323933 DOI: 10.1371/journal.pcbi.1009244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 07/30/2021] [Accepted: 07/02/2021] [Indexed: 11/19/2022] Open
Abstract
The large amount of biological data available in the current times, makes it necessary to use tools and applications based on sophisticated and efficient algorithms, developed in the area of bioinformatics. Further, access to high performance computing resources is necessary, to achieve results in reasonable time. To speed up applications and utilize available compute resources as efficient as possible, software developers make use of parallelization mechanisms, like multithreading. Many of the available tools in bioinformatics offer multithreading capabilities, but more compute power is not always helpful. In this study we investigated the behavior of well-known applications in bioinformatics, regarding their performance in the terms of scaling, different virtual environments and different datasets with our benchmarking tool suite BOOTABLE. The tool suite includes the tools BBMap, Bowtie2, BWA, Velvet, IDBA, SPAdes, Clustal Omega, MAFFT, SINA and GROMACS. In addition we added an application using the machine learning framework TensorFlow. Machine learning is not directly part of bioinformatics but applied to many biological problems, especially in the context of medical images (X-ray photographs). The mentioned tools have been analyzed in two different virtual environments, a virtual machine environment based on the OpenStack cloud software and in a Docker environment. The gained performance values were compared to a bare-metal setup and among each other. The study reveals, that the used virtual environments produce an overhead in the range of seven to twenty-five percent compared to the bare-metal environment. The scaling measurements showed, that some of the analyzed tools do not benefit from using larger amounts of computing resources, whereas others showed an almost linear scaling behavior. The findings of this study have been generalized as far as possible and should help users to find the best amount of resources for their analysis. Further, the results provide valuable information for resource providers to handle their resources as efficiently as possible and raise the user community's awareness of the efficient usage of computing resources.
Collapse
Affiliation(s)
- Maximilian Hanussek
- Group of Applied Bioinformatics, University of Tübingen, Tübingen, Germany
- High Performance and Cloud Computing Group ZDV, University of Tübingen, Tübingen, Germany
| | - Felix Bartusch
- Group of Applied Bioinformatics, University of Tübingen, Tübingen, Germany
- High Performance and Cloud Computing Group ZDV, University of Tübingen, Tübingen, Germany
| | - Jens Krüger
- Group of Applied Bioinformatics, University of Tübingen, Tübingen, Germany
| |
Collapse
|
12
|
Biomedical Image Classification in a Big Data Architecture Using Machine Learning Algorithms. JOURNAL OF HEALTHCARE ENGINEERING 2021; 2021:9998819. [PMID: 34122785 PMCID: PMC8191587 DOI: 10.1155/2021/9998819] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/06/2021] [Revised: 05/09/2021] [Accepted: 05/25/2021] [Indexed: 12/13/2022]
Abstract
In modern-day medicine, medical imaging has undergone immense advancements and can capture several biomedical images from patients. In the wake of this, to assist medical specialists, these images can be used and trained in an intelligent system in order to aid the determination of the different diseases that can be identified from analyzing these images. Classification plays an important role in this regard; it enhances the grouping of these images into categories of diseases and optimizes the next step of a computer-aided diagnosis system. The concept of classification in machine learning deals with the problem of identifying to which set of categories a new population belongs. When category membership is known, the classification is done on the basis of a training set of data containing observations. The goal of this paper is to perform a survey of classification algorithms for biomedical images. The paper then describes how these algorithms can be applied to a big data architecture by using the Spark framework. This paper further proposes the classification workflow based on the observed optimal algorithms, Support Vector Machine and Deep Learning as drawn from the literature. The algorithm for the feature extraction step during the classification process is presented and can be customized in all other steps of the proposed classification workflow.
Collapse
|
13
|
Herrgårdh T, Madai VI, Kelleher JD, Magnusson R, Gustafsson M, Milani L, Gennemark P, Cedersund G. Hybrid modelling for stroke care: Review and suggestions of new approaches for risk assessment and simulation of scenarios. Neuroimage Clin 2021; 31:102694. [PMID: 34000646 PMCID: PMC8141769 DOI: 10.1016/j.nicl.2021.102694] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 04/27/2021] [Accepted: 05/04/2021] [Indexed: 11/28/2022]
Abstract
Stroke is an example of a complex and multi-factorial disease involving multiple organs, timescales, and disease mechanisms. To deal with this complexity, and to realize Precision Medicine of stroke, mathematical models are needed. Such approaches include: 1) machine learning, 2) bioinformatic network models, and 3) mechanistic models. Since these three approaches have complementary strengths and weaknesses, a hybrid modelling approach combining them would be the most beneficial. However, no concrete approach ready to be implemented for a specific disease has been presented to date. In this paper, we both review the strengths and weaknesses of the three approaches, and propose a roadmap for hybrid modelling in the case of stroke care. We focus on two main tasks needed for the clinical setting: a) For stroke risk calculation, we propose a new two-step approach, where non-linear mixed effects models and bioinformatic network models yield biomarkers which are used as input to a machine learning model and b) For simulation of care scenarios, we propose a new four-step approach, which revolves around iterations between simulations of the mechanistic models and imputations of non-modelled or non-measured variables. We illustrate and discuss the different approaches in the context of Precision Medicine for stroke.
Collapse
Affiliation(s)
- Tilda Herrgårdh
- Integrative Systems Biology, Department of Biomedical Engineering, Linköping University, 58185 Linköping, Sweden
| | - Vince I Madai
- Charité Lab for Artificial Intelligence in Medicine - CLAIM, Charité University Medicine Berlin, Germany; School of Computing and Digital Technology, Faculty of Computing, Engineering and the Built Environment, Birmingham City University, Birmingham, UK
| | - John D Kelleher
- ADAPT Research Centre, Technological University Dublin, Ireland
| | - Rasmus Magnusson
- Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Sweden
| | - Mika Gustafsson
- Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Sweden
| | - Lili Milani
- Estonian Genome Center, Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Peter Gennemark
- Integrative Systems Biology, Department of Biomedical Engineering, Linköping University, 58185 Linköping, Sweden; Drug Metabolism and Pharmacokinetics, Early Cardiovascular, Renal and Metabolism, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden
| | - Gunnar Cedersund
- Integrative Systems Biology, Department of Biomedical Engineering, Linköping University, 58185 Linköping, Sweden.
| |
Collapse
|
14
|
Mora-Márquez F, Vázquez-Poletti JL, López de Heredia U. NGScloud2: optimized bioinformatic analysis using Amazon Web Services. PeerJ 2021; 9:e11237. [PMID: 33959420 PMCID: PMC8054753 DOI: 10.7717/peerj.11237] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2020] [Accepted: 03/17/2021] [Indexed: 12/13/2022] Open
Abstract
Background NGScloud was a bioinformatic system developed to perform de novo RNAseq analysis of non-model species by exploiting the cloud computing capabilities of Amazon Web Services. The rapid changes undergone in the way this cloud computing service operates, along with the continuous release of novel bioinformatic applications to analyze next generation sequencing data, have made the software obsolete. NGScloud2 is an enhanced and expanded version of NGScloud that permits the access to ad hoc cloud computing infrastructure, scaled according to the complexity of each experiment. Methods NGScloud2 presents major technical improvements, such as the possibility of running spot instances and the most updated AWS instances types, that can lead to significant cost savings. As compared to its initial implementation, this improved version updates and includes common applications for de novo RNAseq analysis, and incorporates tools to operate workflows of bioinformatic analysis of reference-based RNAseq, RADseq and functional annotation. NGScloud2 optimizes the access to Amazon’s large computing infrastructures to easily run popular bioinformatic software applications, otherwise inaccessible to non-specialized users lacking suitable hardware infrastructures. Results The correct performance of the pipelines for de novo RNAseq, reference-based RNAseq, RADseq and functional annotation was tested with real experimental data, providing workflow performance estimates and tips to make optimal use of NGScloud2. Further, we provide a qualitative comparison of NGScloud2 vs. the Galaxy framework. NGScloud2 code, instructions for software installation and use are available at https://github.com/GGFHF/NGScloud2. NGScloud2 includes a companion package, NGShelper that contains Python utilities to post-process the output of the pipelines for downstream analysis at https://github.com/GGFHF/NGShelper.
Collapse
Affiliation(s)
- Fernando Mora-Márquez
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politécnica de Madrid, Madrid, Spain
| | - José Luis Vázquez-Poletti
- GI Arquitectura de Sistemas Distribuidos, Dpto. de Arquitectura de Ordenadores y Automática, Facultad de Informática, Universidad Complutense de Madrid, Madrid, Spain
| | - Unai López de Heredia
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politécnica de Madrid, Madrid, Spain
| |
Collapse
|
15
|
Chung SS, Ng JCF, Laddach A, Thomas NSB, Fraternali F. Short loop functional commonality identified in leukaemia proteome highlights crucial protein sub-networks. NAR Genom Bioinform 2021; 3:lqab010. [PMID: 33709075 PMCID: PMC7936661 DOI: 10.1093/nargab/lqab010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 12/19/2020] [Accepted: 01/26/2021] [Indexed: 11/13/2022] Open
Abstract
Direct drug targeting of mutated proteins in cancer is not always possible and efficacy can be nullified by compensating protein-protein interactions (PPIs). Here, we establish an in silico pipeline to identify specific PPI sub-networks containing mutated proteins as potential targets, which we apply to mutation data of four different leukaemias. Our method is based on extracting cyclic interactions of a small number of proteins topologically and functionally linked in the Protein-Protein Interaction Network (PPIN), which we call short loop network motifs (SLM). We uncover a new property of PPINs named 'short loop commonality' to measure indirect PPIs occurring via common SLM interactions. This detects 'modules' of PPI networks enriched with annotated biological functions of proteins containing mutation hotspots, exemplified by FLT3 and other receptor tyrosine kinase proteins. We further identify functional dependency or mutual exclusivity of short loop commonality pairs in large-scale cellular CRISPR-Cas9 knockout screening data. Our pipeline provides a new strategy for identifying new therapeutic targets for drug discovery.
Collapse
Affiliation(s)
- Sun Sook Chung
- Department of Haematological Medicine, King's College London, London, SE5 9NU, UK
| | - Joseph C F Ng
- Randall Centre for Cell and Molecular Biophysics, King's College London, London, SE1 1UL, UK
| | - Anna Laddach
- Randall Centre for Cell and Molecular Biophysics, King's College London, London, SE1 1UL, UK
| | - N Shaun B Thomas
- Department of Haematological Medicine, King's College London, London, SE5 9NU, UK
| | - Franca Fraternali
- Randall Centre for Cell and Molecular Biophysics, King's College London, London, SE1 1UL, UK
| |
Collapse
|
16
|
Timonina D, Sharapova Y, Švedas V, Suplatov D. Bioinformatic analysis of subfamily-specific regions in 3D-structures of homologs to study functional diversity and conformational plasticity in protein superfamilies. Comput Struct Biotechnol J 2021; 19:1302-1311. [PMID: 33738079 PMCID: PMC7933735 DOI: 10.1016/j.csbj.2021.02.005] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Revised: 02/08/2021] [Accepted: 02/09/2021] [Indexed: 02/07/2023] Open
Abstract
Local 3D-structural differences in homologous proteins contribute to functional diversity observed in a superfamily, but so far received little attention as bioinformatic analysis was usually carried out at the level of amino acid sequences. We have developed Zebra3D - the first-of-its-kind bioinformatic software for systematic analysis of 3D-alignments of protein families using machine learning. The new tool identifies subfamily-specific regions (SSRs) - patterns of local 3D-structure (i.e. single residues, loops, or secondary structure fragments) that are spatially equivalent within families/subfamilies, but are different among them, and thus can be associated with functional diversity and function-related conformational plasticity. Bioinformatic analysis of protein superfamilies by Zebra3D can be used to study 3D-determinants of catalytic activity and specific accommodation of ligands, help to prepare focused libraries for directed evolution or assist development of chimeric enzymes with novel properties by exchange of equivalent regions between homologs, and to characterize plasticity in binding sites. A companion Mustguseal web-server is available to automatically construct a 3D-alignment of functionally diverse proteins, thus reducing the minimal input required to operate Zebra3D to a single PDB code. The Zebra3D + Mustguseal combined approach provides the opportunity to systematically explore the value of SSRs in superfamilies and to use this information for protein design and drug discovery. The software is available open-access at https://biokinet.belozersky.msu.ru/Zebra3D.
Collapse
Affiliation(s)
- Daria Timonina
- Lomonosov Moscow State University, Faculty of Bioengineering and Bioinformatics, Lenin Hills 1-73, Moscow 119234, Russia
| | - Yana Sharapova
- Lomonosov Moscow State University, Faculty of Bioengineering and Bioinformatics, Lenin Hills 1-73, Moscow 119234, Russia
- Lomonosov Moscow State University, Belozersky Institute of Physicochemical Biology, Lenin Hills 1-73, Moscow 119234, Russia
| | - Vytas Švedas
- Lomonosov Moscow State University, Faculty of Bioengineering and Bioinformatics, Lenin Hills 1-73, Moscow 119234, Russia
- Lomonosov Moscow State University, Belozersky Institute of Physicochemical Biology, Lenin Hills 1-73, Moscow 119234, Russia
| | - Dmitry Suplatov
- Lomonosov Moscow State University, Belozersky Institute of Physicochemical Biology, Lenin Hills 1-73, Moscow 119234, Russia
- Corresponding author.
| |
Collapse
|
17
|
Mora-Márquez F, Vázquez-Poletti JL, Chano V, Collada C, Soto Á, de Heredia UL. Hardware Performance Evaluation of De novo Transcriptome Assembly Software in Amazon Elastic Compute Cloud. Curr Bioinform 2020. [DOI: 10.2174/1574893615666191219095817] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Bioinformatics software for RNA-seq analysis has a high computational
requirement in terms of the number of CPUs, RAM size, and processor characteristics.
Specifically, de novo transcriptome assembly demands large computational infrastructure due to
the massive data size, and complexity of the algorithms employed. Comparative studies on the
quality of the transcriptome yielded by de novo assemblers have been previously published,
lacking, however, a hardware efficiency-oriented approach to help select the assembly hardware
platform in a cost-efficient way.
Objective:
We tested the performance of two popular de novo transcriptome assemblers, Trinity
and SOAPdenovo-Trans (SDNT), in terms of cost-efficiency and quality to assess limitations, and
provided troubleshooting and guidelines to run transcriptome assemblies efficiently.
Methods:
We built virtual machines with different hardware characteristics (CPU number, RAM
size) in the Amazon Elastic Compute Cloud of the Amazon Web Services. Using simulated and
real data sets, we measured the elapsed time, cost, CPU percentage and output size of small and
large data set assemblies.
Results:
For small data sets, SDNT outperformed Trinity by an order the magnitude, significantly
reducing the time duration and costs of the assembly. For large data sets, Trinity performed better
than SDNT. Both the assemblers provide good quality transcriptomes.
Conclusion:
The selection of the optimal transcriptome assembler and provision of computational
resources depend on the combined effect of size and complexity of RNA-seq experiments.
Collapse
Affiliation(s)
- Fernando Mora-Márquez
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - José Luis Vázquez-Poletti
- GI Arquitectura de Sistemas Distribuidos, Dpto. Arquitectura de Computadores y Automatica, Facultad de Informatica, Universidad Complutense de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - Víctor Chano
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - Carmen Collada
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - Álvaro Soto
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - Unai López de Heredia
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| |
Collapse
|
18
|
Maués JHDS, Moreira-Nunes CDFA, Burbano RMR. Computational Identification and Characterization of New microRNAs in Human Platelets Stored in a Blood Bank. Biomolecules 2020; 10:biom10081173. [PMID: 32806499 PMCID: PMC7464399 DOI: 10.3390/biom10081173] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Revised: 08/05/2020] [Accepted: 08/06/2020] [Indexed: 12/15/2022] Open
Abstract
Platelet concentrate (PC) transfusions are widely used to save the lives of patients who experience acute blood loss. MicroRNAs (miRNAs) comprise a class of molecules with a biological role which is relevant to the understanding of storage lesions in blood banks. We used a new approach to identify miRNAs in normal human platelet sRNA-Seq data from the GSE61856 repository. We identified a comprehensive miRNA expression profile, where we detected 20 of these transcripts potentially expressed in PCs stored for seven days, which had their expression levels analyzed with simulations of computational biology. Our results identified a new collection of miRNAs (miR-486-5p, miR-92a-3p, miR-103a-3p, miR-151a-3p, miR-181a-5p, and miR-221-3p) that showed a sensitivity expression pattern due to biological platelet changes during storage, confirmed by additional quantitative real-time polymerase chain reaction (qPCR) validation on 100 PC units from 500 healthy donors. We also identified that these miRNAs could transfer regulatory information on platelets, such as members of the let-7 family, by regulating the YOD1 gene, which is a deubiquitinating enzyme highly expressed in platelet hyperactivity. Our results also showed that the target genes of these miRNAs play important roles in signaling pathways, cell cycle, stress response, platelet activation and cancer. In summary, the miRNAs described in this study, have a promising application in transfusion medicine as potential biomarkers to also measure the quality and viability of the PC during storage in blood banks.
Collapse
Affiliation(s)
- Jersey Heitor da Silva Maués
- Laboratory of Human Cytogenetics, Institute of Biological Sciences, Federal University of Pará, Belém, PA 66075-110, Brazil;
- Laboratory of Molecular Biology, Ophir Loyola Hospital, Belém, PA 66063-240, Brazil
- Correspondence: (J.H.d.S.M.); (C.d.F.A.M.-N.)
| | - Caroline de Fátima Aquino Moreira-Nunes
- Laboratory of Pharmacogenetics, Drug Research and Development Center (NPDM), Federal University of Ceará, Fortaleza, CE 60430-275, Brazil
- Correspondence: (J.H.d.S.M.); (C.d.F.A.M.-N.)
| | - Rommel Mário Rodriguez Burbano
- Laboratory of Human Cytogenetics, Institute of Biological Sciences, Federal University of Pará, Belém, PA 66075-110, Brazil;
- Laboratory of Molecular Biology, Ophir Loyola Hospital, Belém, PA 66063-240, Brazil
| |
Collapse
|
19
|
McLean C, Kujawinski EB. AutoTuner: High Fidelity and Robust Parameter Selection for Metabolomics Data Processing. Anal Chem 2020; 92:5724-5732. [PMID: 32212641 PMCID: PMC7310949 DOI: 10.1021/acs.analchem.9b04804] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
![]()
Untargeted
metabolomics experiments provide a snapshot of cellular
metabolism but remain challenging to interpret due to the computational
complexity involved in data processing and analysis. Prior to any
interpretation, raw data must be processed to remove noise and to
align mass-spectral peaks across samples. This step requires selection
of dataset-specific parameters, as erroneous parameters can result
in noise inflation. While several algorithms exist to automate parameter
selection, each depends on gradient descent optimization functions.
In contrast, our new parameter optimization algorithm, AutoTuner,
obtains parameter estimates from raw data in a single step as opposed
to many iterations. Here, we tested the accuracy and the run-time
of AutoTuner in comparison to isotopologue parameter optimization
(IPO), the most commonly used parameter selection tool, and compared
the resulting parameters’ influence on the properties of feature
tables after processing. We performed a Monte Carlo experiment to
test the robustness of AutoTuner parameter selection and found that
AutoTuner generated similar parameter estimates from random subsets
of samples. We conclude that AutoTuner is a desirable alternative
to existing tools, because it is scalable, highly robust, and very
fast (∼100–1000× speed improvement from other algorithms
going from days to minutes). AutoTuner is freely available as an R
package through BioConductor.
Collapse
Affiliation(s)
- Craig McLean
- Department of Marine Chemistry and Geochemistry, Woods Hole Oceanographic Institution, Woods Hole, Massachusetts 02543, United States.,MIT/WHOI Joint Program in Oceanography/Applied Ocean Science and Engineering, Department of Marine Chemistry and Geochemistry, Woods Hole Oceanographic Institution, Woods Hole, Massachusetts 02543, United States
| | - Elizabeth B Kujawinski
- Department of Marine Chemistry and Geochemistry, Woods Hole Oceanographic Institution, Woods Hole, Massachusetts 02543, United States
| |
Collapse
|
20
|
Azad RK, Shulaev V. Metabolomics technology and bioinformatics for precision medicine. Brief Bioinform 2019; 20:1957-1971. [PMID: 29304189 PMCID: PMC6954408 DOI: 10.1093/bib/bbx170] [Citation(s) in RCA: 102] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2017] [Revised: 11/29/2017] [Indexed: 12/14/2022] Open
Abstract
Precision medicine is rapidly emerging as a strategy to tailor medical treatment to a small group or even individual patients based on their genetics, environment and lifestyle. Precision medicine relies heavily on developments in systems biology and omics disciplines, including metabolomics. Combination of metabolomics with sophisticated bioinformatics analysis and mathematical modeling has an extreme power to provide a metabolic snapshot of the patient over the course of disease and treatment or classifying patients into subpopulations and subgroups requiring individual medical treatment. Although a powerful approach, metabolomics have certain limitations in technology and bioinformatics. We will review various aspects of metabolomics technology and bioinformatics, from data generation, bioinformatics analysis, data fusion and mathematical modeling to data management, in the context of precision medicine.
Collapse
Affiliation(s)
| | - Vladimir Shulaev
- Corresponding author: Vladimir Shulaev, Department of Biological Sciences, BioDiscovery Institute, University of North Texas, Denton, TX 76210, USA. Tel.: 940-369-5368; Fax: 940-565-3821; E-mail:
| |
Collapse
|
21
|
Montemayor C, Brunker PAR, Keller MA. Banking with precision: transfusion medicine as a potential universal application in clinical genomics. Curr Opin Hematol 2019; 26:480-487. [PMID: 31490317 PMCID: PMC7302862 DOI: 10.1097/moh.0000000000000536] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
PURPOSE OF REVIEW To summarize the most recent scientific progress in transfusion medicine genomics and discuss its role within the broad genomic precision medicine model, with a focus on the unique computational and bioinformatic aspects of this emergent field. RECENT FINDINGS Recent publications continue to validate the feasibility of using next-generation sequencing (NGS) for blood group prediction with three distinct approaches: exome sequencing, whole genome sequencing, and PCR-based targeted NGS methods. The reported correlation of NGS with serologic and alternative genotyping methods ranges from 92 to 99%. NGS has demonstrated improved detection of weak antigens, structural changes, copy number variations, novel genomic variants, and microchimerism. Addition of a transfusion medicine interpretation to any clinically sequenced genome is proposed as a strategy to enhance the cost-effectiveness of precision genomic medicine. Interpretation of NGS in the blood group antigen context requires not only advanced immunohematology knowledge, but also specialized software and hardware resources, and a bioinformatics-trained workforce. SUMMARY Blood transfusions are a common inpatient procedure, making blood group genomics a promising facet of precision medicine research. Further efforts are needed to embrace transfusion bioinformatic challenges and evaluate its clinical utility.
Collapse
Affiliation(s)
- Celina Montemayor
- Department of Transfusion Medicine, National Institutes of Health Clinical Center, Bethesda, MD
| | - Patricia A. R. Brunker
- Division of Transfusion Medicine, Department of Pathology, The Johns Hopkins Hospital, Baltimore, MD
- American Red Cross, Greater Chesapeake and Potomac Region, Baltimore, MD
| | | |
Collapse
|
22
|
Savosina PI, Stolbov LA, Druzhilovskiy DS, Filimonov DA, Nicklaus MC, Poroikov VV. [Discovering new antiretroviral compounds in "Big Data" chemical space of the SAVI library]. BIOMEDIT︠S︡INSKAI︠A︡ KHIMII︠A︡ 2019; 65:73-79. [PMID: 30950810 DOI: 10.18097/pbmc20196502073] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Despite significant advances in the application of highly active antiretroviral therapy, the development of new drugs for the treatment of HIV infection remains an important task because the existing drugs do not provide a complete cure, cause serious side effects and lead to the emergence of resistance. In 2015, a consortium of American and European scientists and specialists launched a project to create the SAVI (Synthetically Accessible Virtual Inventory) library. Its 2016 version of over 283 million structures of new easily synthesizable organic molecules, each annotated with a proposed synthetic route, were generated <i>in silico</i> for the purpose of searching for safer and more potent pharmacological substances. We have developed an algorithm for comparing large chemical databases (DB) based on the representation of structural formulas in SMILES codes, and evaluated the possibility of detecting new antiretroviral compounds in the SAVI database. After analyzing the intersection of SAVI with 97 million structures of the PubChem database, we found that only a small part of the SAVI (~0.015%) is represented in PubChem, which indicates a significant novelty of this virtual library. However, among those structures, 632 compounds tested for anti-HIV activity were detected, 41 of which had the desired activity. Thus, our studies for the first time demonstrated that SAVI is a promising source for the search for new anti-HIV compounds.
Collapse
Affiliation(s)
- P I Savosina
- Institute of Biomedical Chemistry, Moscow, Russia
| | - L A Stolbov
- Institute of Biomedical Chemistry, Moscow, Russia
| | | | | | - M C Nicklaus
- Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, Maryland, United States
| | - V V Poroikov
- Institute of Biomedical Chemistry, Moscow, Russia
| |
Collapse
|
23
|
García del Valle EP, Lagunes García G, Prieto Santamaría L, Zanin M, Menasalvas Ruiz E, Rodríguez-González A. Disease networks and their contribution to disease understanding: A review of their evolution, techniques and data sources. J Biomed Inform 2019; 94:103206. [DOI: 10.1016/j.jbi.2019.103206] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Revised: 04/14/2019] [Accepted: 05/06/2019] [Indexed: 12/14/2022]
|
24
|
Russell PH, Johnson RL, Ananthan S, Harnke B, Carlson NE. A large-scale analysis of bioinformatics code on GitHub. PLoS One 2018; 13:e0205898. [PMID: 30379882 PMCID: PMC6209220 DOI: 10.1371/journal.pone.0205898] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Accepted: 10/03/2018] [Indexed: 11/19/2022] Open
Abstract
In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/github-bioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.
Collapse
Affiliation(s)
- Pamela H. Russell
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, United States of America
- * E-mail:
| | - Rachel L. Johnson
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, United States of America
| | - Shreyas Ananthan
- High-Performance Algorithms and Complex Fluids, National Renewable Energy Laboratory, Golden, CO, United States of America
| | - Benjamin Harnke
- Health Sciences Library, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America
| | - Nichole E. Carlson
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, United States of America
| |
Collapse
|
25
|
Epidemiology in wonderland: Big Data and precision medicine. Eur J Epidemiol 2018; 33:245-257. [PMID: 29623670 DOI: 10.1007/s10654-018-0385-9] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2018] [Accepted: 03/30/2018] [Indexed: 10/17/2022]
Abstract
Big Data and precision medicine, two major contemporary challenges for epidemiology, are critically examined from two different angles. In Part 1 Big Data collected for research purposes (Big research Data) and Big Data used for research although collected for other primary purposes (Big secondary Data) are discussed in the light of the fundamental common requirement of data validity, prevailing over "bigness". Precision medicine is treated developing the key point that high relative risks are as a rule required to make a variable or combination of variables suitable for prediction of disease occurrence, outcome or response to treatment; the commercial proliferation of allegedly predictive tests of unknown or poor validity is commented. Part 2 proposes a "wise epidemiology" approach to: (a) choosing in a context imprinted by Big Data and precision medicine-epidemiological research projects actually relevant to population health, (b) training epidemiologists,
Collapse
|
26
|
D'Argenio V. The High-Throughput Analyses Era: Are We Ready for the Data Struggle? High Throughput 2018; 7:E8. [PMID: 29498666 PMCID: PMC5876534 DOI: 10.3390/ht7010008] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Revised: 02/16/2018] [Accepted: 02/27/2018] [Indexed: 12/23/2022] Open
Abstract
Recent and rapid technological advances in molecular sciences have dramatically increased the ability to carry out high-throughput studies characterized by big data production. This, in turn, led to the consequent negative effect of highlighting the presence of a gap between data yield and their analysis. Indeed, big data management is becoming an increasingly important aspect of many fields of molecular research including the study of human diseases. Now, the challenge is to identify, within the huge amount of data obtained, that which is of clinical relevance. In this context, issues related to data interpretation, sharing and storage need to be assessed and standardized. Once this is achieved, the integration of data from different -omic approaches will improve the diagnosis, monitoring and therapy of diseases by allowing the identification of novel, potentially actionably biomarkers in view of personalized medicine.
Collapse
Affiliation(s)
- Valeria D'Argenio
- CEINGE-Biotecnologie Avanzate, via G. Salvatore 486, 80145 Naples, Italy.
- Department of Molecular Medicine and Medical Biotechnologies, University of Naples Federico II, via Pansini 5, 80131 Naples, Italy.
| |
Collapse
|