1
|
Salles MMA, Domingos FMCB. Towards the next generation of species delimitation methods: an overview of machine learning applications. Mol Phylogenet Evol 2025; 210:108368. [PMID: 40348350 DOI: 10.1016/j.ympev.2025.108368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 02/25/2025] [Accepted: 05/04/2025] [Indexed: 05/14/2025]
Abstract
Species delimitation is the process of distinguishing between populations of the same species and distinct species of a particular group of organisms. Various methods exist for inferring species limits, whether based on morphological, molecular, or other types of data. In the case of methods based on DNA sequences, most of them are rooted in the coalescent theory. However, coalescence-based models have limitations, for instance regarding complex evolutionary scenarios and large datasets. In this context, machine learning (ML) can be considered as a promising analytical tool, and provides an effective way to explore dataset structures when species-level divergences are hypothesized. In this review, we examine the use of ML in species delimitation and provide an overview and critical appraisal of existing workflows. We also provide simple explanations on how the main types of ML approaches operate, which should help uninitiated researchers and students interested in the field. Our review suggests that while current ML methods designed to infer species limits are analytically powerful, they also present specific limitations and should not be considered as definitive alternatives to coalescent methods for species delimitation. Future ML enterprises to delimit species should consider the constraints related to the use of simulated data, as in other model-based methods relying on simulations. Conversely, the flexibility of ML algorithms offers a significant advantage by enabling the analysis of diverse data types (e.g., genetic and phenotypic) and handling large datasets effectively. We also propose best practices for the use of ML methods in species delimitation, offering insights into potential future applications. We expect that the proposed guidelines will be useful for enhancing the accessibility, effectiveness, and objectivity of ML in species delimitation.
Collapse
Affiliation(s)
- Matheus M A Salles
- Departamento de Zoologia, Universidade Federal do Paraná, Curitiba 81531-980, Brazil.
| | | |
Collapse
|
2
|
Llera-Oyola J, Pérez-Moraga R, Parras M, Rosón B. How to view the female reproductive tract through single-cell looking glasses. Am J Obstet Gynecol 2025; 232:S21-S43. [PMID: 40253081 DOI: 10.1016/j.ajog.2024.08.040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 07/04/2024] [Accepted: 08/24/2024] [Indexed: 04/21/2025]
Abstract
Single-cell technologies have emerged as an unprecedented tool for biologists and clinicians, allowing them to assess organs and tissues at the level of individual cells. In the field of women's reproductive biology, single-cell studies have provided insights into the cellular and molecular processes that regulate reproductive and obstetrical functions in health and disease. The knowledge that these studies generate is helping clinicians to improve the understanding and diagnosis of infertility related issues or pregnancy complications and to find new avenues for their treatment. However, navigating the expansive landscape of this type of transcriptomic data analysis represents a pivotal challenge in current research. Single cell RNA sequencing involves isolating cells into droplets, reverse transcribing RNA to generate complementary DNA, with each droplet content uniquely labeled by a barcode. Upon sequencing the complementary DNAs, the barcodes enable the reassignment of sequencing reads to individual droplets, facilitating the reconstruction of the cellular landscape of the sample obtained from a tissue or organ and beyond. Researchers, equipped with the metaphorical 'single-cell glasses,' must adequately choose from a plethora of strategies to dissect and interpret cellular information. Sophisticated algorithms and the decision-making process are often underestimated, resulting in artefactual or cumbersome interpreted results. Computational biologists apply and innovate computational tools designed to process, model, and interpret expansive datasets. The ramifications of their work extend far beyond the realm of data processing; they give shape to the outcome of analyses, playing a pivotal role in drawing meaningful conclusions from the wealth of information garnered. In this review, we describe the wide variety of approaches and analytical steps available with enough detail to gain a concise picture of what a complete examination of a single-cell dataset would be. We commence with a discussion on key points in experimental design, highlighting crucial questions one should consider. Following this, we delve into the various preprocessing and quality control steps essential for any single-cell dataset. The subsequent section offers a detailed guide on constructing a single-cell atlas, exploring nuances such as differential characteristics in visualization and clustering techniques, as well as strategies for assigning identity to cell populations through gene marker annotations. Moving beyond the creation of an atlas, we explore methods for investigating pathological conditions. This involves conducting cell population comparison tests between conditions and analyzing specific cell-to-cell communications and cellular differentiation trajectories in both health and disease scenarios. This work aims to furnish a newcomer researcher and/or clinician with essential guidelines to embark on a single-cell adventure without succumbing to common pitfalls. By bridging the gap between theory and practice, it facilitates the translation of single-cell technologies into clinically relevant applications. Throughout the manuscript, practical examples of its usage in women's reproductive health studies are provided. Various sections delve into specific clinical scenarios, demonstrating how these guidelines can be instrumental in unraveling the molecular landscapes of diseases and physiological processes related to women's reproduction.
Collapse
Affiliation(s)
- Jaime Llera-Oyola
- Carlos Simon Foundation, INCLIVA Health Research Institute, Valencia, Spain
| | - Raúl Pérez-Moraga
- Carlos Simon Foundation, INCLIVA Health Research Institute, Valencia, Spain; R&D Department, Igenomix, Valencia, Spain
| | - Marcos Parras
- Carlos Simon Foundation, INCLIVA Health Research Institute, Valencia, Spain
| | - Beatriz Rosón
- Carlos Simon Foundation, INCLIVA Health Research Institute, Valencia, Spain.
| |
Collapse
|
3
|
Alser M, Lawlor B, Abdill RJ, Waymost S, Ayyala R, Rajkumar N, LaPierre N, Brito J, Ribeiro-Dos-Santos AM, Almadhoun N, Sarwal V, Firtina C, Osinski T, Eskin E, Hu Q, Strong D, Kim BDBD, Abedalthagafi MS, Mutlu O, Mangul S. Packaging and containerization of computational methods. Nat Protoc 2024; 19:2529-2539. [PMID: 38565959 DOI: 10.1038/s41596-024-00986-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 02/12/2024] [Indexed: 04/04/2024]
Abstract
Methods for analyzing the full complement of a biomolecule type, e.g., proteomics or metabolomics, generate large amounts of complex data. The software tools used to analyze omics data have reshaped the landscape of modern biology and become an essential component of biomedical research. These tools are themselves quite complex and often require the installation of other supporting software, libraries and/or databases. A researcher may also be using multiple different tools that require different versions of the same supporting materials. The increasing dependence of biomedical scientists on these powerful tools creates a need for easier installation and greater usability. Packaging and containerization are different approaches to satisfy this need by delivering omics tools already wrapped in additional software that makes the tools easier to install and use. In this systematic review, we describe and compare the features of prominent packaging and containerization platforms. We outline the challenges, advantages and limitations of each approach and some of the most widely used platforms from the perspectives of users, software developers and system administrators. We also propose principles to make the distribution of omics software more sustainable and robust to increase the reproducibility of biomedical and life science research.
Collapse
Affiliation(s)
- Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Brendan Lawlor
- Department of Computer Science, Munster Technological University, Cork, Ireland
- Department of Biological Sciences, Munster Technological University, Cork, Ireland
| | - Richard J Abdill
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Sharon Waymost
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Ram Ayyala
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA
| | - Neha Rajkumar
- Department of Bioengineering, University of California, Los Angeles, Los Angeles, CA, USA
| | - Nathan LaPierre
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Jaqueline Brito
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA
| | | | - Nour Almadhoun
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Varuni Sarwal
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Can Firtina
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Tomasz Osinski
- Center for Advanced Research Computing, University of Southern California, Los Angeles, CA, USA
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, University of California, Los Angeles, CA, USA
| | - Qiyang Hu
- Office of Advanced Research Computing, University of California, Los Angeles, CA, USA
| | - Derek Strong
- Center for Advanced Research Computing, University of Southern California, Los Angeles, CA, USA
| | - Byoung-Do B D Kim
- Center for Advanced Research Computing, University of Southern California, Los Angeles, CA, USA
| | - Malak S Abedalthagafi
- Department of Pathology & Laboratory Medicine, Emory University Hospital, Atlanta, GA, USA
- King Salman Center for Disability Research, Riyadh, Saudi Arabia
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Serghei Mangul
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
4
|
Baykal PI, Łabaj PP, Markowetz F, Schriml LM, Stekhoven DJ, Mangul S, Beerenwinkel N. Genomic reproducibility in the bioinformatics era. Genome Biol 2024; 25:213. [PMID: 39123217 PMCID: PMC11312195 DOI: 10.1186/s13059-024-03343-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 07/23/2024] [Indexed: 08/12/2024] Open
Abstract
In biomedical research, validating a scientific discovery hinges on the reproducibility of its experimental results. However, in genomics, the definition and implementation of reproducibility remain imprecise. We argue that genomic reproducibility, defined as the ability of bioinformatics tools to maintain consistent results across technical replicates, is essential for advancing scientific knowledge and medical applications. Initially, we examine different interpretations of reproducibility in genomics to clarify terms. Subsequently, we discuss the impact of bioinformatics tools on genomic reproducibility and explore methods for evaluating these tools regarding their effectiveness in ensuring genomic reproducibility. Finally, we recommend best practices to improve genomic reproducibility.
Collapse
Affiliation(s)
- Pelin Icer Baykal
- Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, 4058, Basel, Switzerland
| | - Paweł Piotr Łabaj
- Małopolska Centre of Biotechnology, Jagiellonian University, 30-387, Gronostajowa 7A, Krakow, Poland
- Department of Biotechnology, Boku University Vienna, Muthgasse 18, 1190, Vienna, Austria
| | - Florian Markowetz
- Cancer Research UK Cambridge Research Institute, Cambridge, CB2 0RE, UK
- Department of Oncology, University of Cambridge, Cambridge, CB2 2XZ, UK
| | - Lynn M Schriml
- Institute for Genome Sciences, University of Maryland School of Medicine, HSFIII, 670 W. Baltimore St, Baltimore, MD, 21201, USA
| | - Daniel J Stekhoven
- SIB Swiss Institute of Bioinformatics, 4058, Basel, Switzerland
- NEXUS Personalized Health Technologies, ETH Zurich, 8952, Zurich, Switzerland
| | - Serghei Mangul
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, 1540 Alcazar Street, Los Angeles, CA, 90033, USA.
- Department of Quantitative and Computational Biology, University of Southern California Dornsife College of Letters, Arts, and Sciences, Los Angeles, CA, 90089, USA.
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland.
- SIB Swiss Institute of Bioinformatics, 4058, Basel, Switzerland.
| |
Collapse
|
5
|
Carraro C, Montgomery JV, Klimmt J, Paquet D, Schultze JL, Beyer MD. Tackling neurodegeneration in vitro with omics: a path towards new targets and drugs. Front Mol Neurosci 2024; 17:1414886. [PMID: 38952421 PMCID: PMC11215216 DOI: 10.3389/fnmol.2024.1414886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Accepted: 06/04/2024] [Indexed: 07/03/2024] Open
Abstract
Drug discovery is a generally inefficient and capital-intensive process. For neurodegenerative diseases (NDDs), the development of novel therapeutics is particularly urgent considering the long list of late-stage drug candidate failures. Although our knowledge on the pathogenic mechanisms driving neurodegeneration is growing, additional efforts are required to achieve a better and ultimately complete understanding of the pathophysiological underpinnings of NDDs. Beyond the etiology of NDDs being heterogeneous and multifactorial, this process is further complicated by the fact that current experimental models only partially recapitulate the major phenotypes observed in humans. In such a scenario, multi-omic approaches have the potential to accelerate the identification of new or repurposed drugs against a multitude of the underlying mechanisms driving NDDs. One major advantage for the implementation of multi-omic approaches in the drug discovery process is that these overarching tools are able to disentangle disease states and model perturbations through the comprehensive characterization of distinct molecular layers (i.e., genome, transcriptome, proteome) up to a single-cell resolution. Because of recent advances increasing their affordability and scalability, the use of omics technologies to drive drug discovery is nascent, but rapidly expanding in the neuroscience field. Combined with increasingly advanced in vitro models, which particularly benefited from the introduction of human iPSCs, multi-omics are shaping a new paradigm in drug discovery for NDDs, from disease characterization to therapeutics prediction and experimental screening. In this review, we discuss examples, main advantages and open challenges in the use of multi-omic approaches for the in vitro discovery of targets and therapies against NDDs.
Collapse
Affiliation(s)
- Caterina Carraro
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen e.V. (DZNE), Bonn, Germany
- Genomics and Immunoregulation, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany
| | - Jessica V. Montgomery
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen e.V. (DZNE), Bonn, Germany
| | - Julien Klimmt
- Institute for Stroke and Dementia Research (ISD), University Hospital, LMU Munich, Munich, Germany
| | - Dominik Paquet
- Institute for Stroke and Dementia Research (ISD), University Hospital, LMU Munich, Munich, Germany
- Munich Cluster for Systems Neurology (SyNergy), Munich, Germany
| | - Joachim L. Schultze
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen e.V. (DZNE), Bonn, Germany
- Genomics and Immunoregulation, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany
- PRECISE, Platform for Single Cell Genomics and Epigenomics at the German Center for Neurodegenerative Diseases and the University of Bonn and West German Genome Center, Bonn, Germany
| | - Marc D. Beyer
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen e.V. (DZNE), Bonn, Germany
- PRECISE, Platform for Single Cell Genomics and Epigenomics at the German Center for Neurodegenerative Diseases and the University of Bonn and West German Genome Center, Bonn, Germany
- Immunogenomics & Neurodegeneration, Deutsches Zentrum für Neurodegenerative Erkrankungen e.V. (DZNE), Bonn, Germany
| |
Collapse
|
6
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
7
|
Monterrubio-Gómez K, Constantine-Cooke N, Vallejos CA. A review on statistical and machine learning competing risks methods. Biom J 2024; 66:e2300060. [PMID: 38351217 DOI: 10.1002/bimj.202300060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 08/31/2023] [Accepted: 10/15/2023] [Indexed: 02/16/2024]
Abstract
When modeling competing risks (CR) survival data, several techniques have been proposed in both the statistical and machine learning literature. State-of-the-art methods have extended classical approaches with more flexible assumptions that can improve predictive performance, allow high-dimensional data and missing values, among others. Despite this, modern approaches have not been widely employed in applied settings. This article aims to aid the uptake of such methods by providing a condensed compendium of CR survival methods with a unified notation and interpretation across approaches. We highlight available software and, when possible, demonstrate their usage via reproducible R vignettes. Moreover, we discuss two major concerns that can affect benchmark studies in this context: the choice of performance metrics and reproducibility.
Collapse
Affiliation(s)
| | - Nathan Constantine-Cooke
- MRC Human Genetics Unit, University of Edinburgh, Edinburgh, UK
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Catalina A Vallejos
- MRC Human Genetics Unit, University of Edinburgh, Edinburgh, UK
- The Alan Turing Institute, London, UK
| |
Collapse
|
8
|
Steinberg PL, Liu LY, Neiman-Golden A, Patel Y, Boutros PC. Quantifying the seed sensitivity of cancer subclonal reconstruction algorithms. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.05.579021. [PMID: 38370678 PMCID: PMC10871259 DOI: 10.1101/2024.02.05.579021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Background Intra-tumoural heterogeneity complicates cancer prognosis and impairs treatment success. One of the ways subclonal reconstruction (SRC) quantifies intra-tumoural heterogeneity is by estimating the number of subclones present in bulk DNA sequencing data. SRC algorithms are probabilistic and need to be initialized by a random seed. However, the seeds used in bioinformatics algorithms are rarely reported in the literature. Thus, the impact of the initializing seed on SRC solutions has not been studied. To address this gap, we generated a set of ten random seeds to systematically benchmark the seed sensitivity of three probabilistic SRC algorithms: PyClone-VI, DPClust, and PhyloWGS. Results We characterized the seed sensitivity of three algorithms across fourteen whole-genome sequences of head and neck squamous cell carcinoma and nine SRC pipelines, each composed of a single nucleotide variant caller, a copy number aberration caller and an SRC algorithm. This led to a total of 1470 subclonal reconstructions, including 1260 single-region and 210 multi-region reconstructions. The number of subclones estimated per patient vary across SRC pipelines, but all three SRC algorithms show substantial seed sensitivity: subclone estimates vary across different seeds for the same set of input using the same SRC algorithm. No seed consistently estimated the mode number of subclones across all patients for any SRC algorithm. Conclusions These findings highlight the variability in quantifying intra-tumoural heterogeneity introduced by the seed sensitivity of probabilistic SRC algorithms. We recommend that authors, reviewers and editors adopt guidelines to both report and randomize seed choices. It may also be valuable to consider seed-sensitivity in the benchmarking of newly developed SRC algorithms. These findings may be of interest in other areas of bioinformatics where seeded probabilistic algorithms are used and suggest consideration of formal seed reporting standards to enhance reproducibility.
Collapse
Affiliation(s)
- Philippa L. Steinberg
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Jonsson Comprehensive Cancer Centre, University of California, Los Angeles, Los Angeles, CA, 90024, USA
- Institute for Precision Health, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Lydia Y. Liu
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Jonsson Comprehensive Cancer Centre, University of California, Los Angeles, Los Angeles, CA, 90024, USA
- Institute for Precision Health, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 2C1, Canada
| | - Anna Neiman-Golden
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Jonsson Comprehensive Cancer Centre, University of California, Los Angeles, Los Angeles, CA, 90024, USA
- Institute for Precision Health, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Yash Patel
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Jonsson Comprehensive Cancer Centre, University of California, Los Angeles, Los Angeles, CA, 90024, USA
- Institute for Precision Health, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Paul C. Boutros
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Jonsson Comprehensive Cancer Centre, University of California, Los Angeles, Los Angeles, CA, 90024, USA
- Institute for Precision Health, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Department of Urology, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| |
Collapse
|
9
|
Sami A, El-Metwally S, Rashad MZ. MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads. BMC Bioinformatics 2024; 25:61. [PMID: 38321434 PMCID: PMC10848413 DOI: 10.1186/s12859-024-05681-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 01/29/2024] [Indexed: 02/08/2024] Open
Abstract
BACKGROUND The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. RESULTS We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. CONCLUSIONS This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.
Collapse
Affiliation(s)
- Amira Sami
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Sara El-Metwally
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.
- Biomedical Informatics Department, Faculty of Computer Science and Engineering, New Mansoura University, Gamasa, 35712, Egypt.
| | - M Z Rashad
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| |
Collapse
|
10
|
Mukherjee A, Kar I, Patra AK. Understanding anthelmintic resistance in livestock using "omics" approaches. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2023; 30:125439-125463. [PMID: 38015400 DOI: 10.1007/s11356-023-31045-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 11/08/2023] [Indexed: 11/29/2023]
Abstract
Widespread and improper use of various anthelmintics, genetic, and epidemiological factors has resulted in anthelmintic-resistant (AR) helminth populations in livestock. This is currently quite common globally in different livestock animals including sheep, goats, and cattle to gastrointestinal nematode (GIN) infections. Therefore, the mechanisms underlying AR in parasitic worm species have been the subject of ample research to tackle this challenge. Current and emerging technologies in the disciplines of genomics, transcriptomics, metabolomics, and proteomics in livestock species have advanced the understanding of the intricate molecular AR mechanisms in many major parasites. The technologies have improved the identification of possible biomarkers of resistant parasites, the ability to find actual causative genes, regulatory networks, and pathways of parasites governing the AR development including the dynamics of helminth infection and host-parasite infections. In this review, various "omics"-driven technologies including genome scan, candidate gene, quantitative trait loci, transcriptomic, proteomic, and metabolomic approaches have been described to understand AR of parasites of veterinary importance. Also, challenges and future prospects of these "omics" approaches are also discussed.
Collapse
Affiliation(s)
- Ayan Mukherjee
- Department of Animal Biotechnology, West Bengal University of Animal and Fishery Sciences, Nadia, Mohanpur, West Bengal, India
| | - Indrajit Kar
- Department of Avian Sciences, West Bengal University of Animal and Fishery Sciences, Nadia, Mohanpur, West Bengal, India
| | - Amlan Kumar Patra
- American Institute for Goat Research, Langston University, Oklahoma, 73050, USA.
| |
Collapse
|
11
|
Fernandez ME, Martinez-Romero J, Aon MA, Bernier M, Price NL, de Cabo R. How is Big Data reshaping preclinical aging research? Lab Anim (NY) 2023; 52:289-314. [PMID: 38017182 DOI: 10.1038/s41684-023-01286-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 10/10/2023] [Indexed: 11/30/2023]
Abstract
The exponential scientific and technological progress during the past 30 years has favored the comprehensive characterization of aging processes with their multivariate nature, leading to the advent of Big Data in preclinical aging research. Spanning from molecular omics to organism-level deep phenotyping, Big Data demands large computational resources for storage and analysis, as well as new analytical tools and conceptual frameworks to gain novel insights leading to discovery. Systems biology has emerged as a paradigm that utilizes Big Data to gain insightful information enabling a better understanding of living organisms, visualized as multilayered networks of interacting molecules, cells, tissues and organs at different spatiotemporal scales. In this framework, where aging, health and disease represent emergent states from an evolving dynamic complex system, context given by, for example, strain, sex and feeding times, becomes paramount for defining the biological trajectory of an organism. Using bioinformatics and artificial intelligence, the systems biology approach is leading to remarkable advances in our understanding of the underlying mechanism of aging biology and assisting in creative experimental study designs in animal models. Future in-depth knowledge acquisition will depend on the ability to fully integrate information from different spatiotemporal scales in organisms, which will probably require the adoption of theories and methods from the field of complex systems. Here we review state-of-the-art approaches in preclinical research, with a focus on rodent models, that are leading to conceptual and/or technical advances in leveraging Big Data to understand basic aging biology and its full translational potential.
Collapse
Affiliation(s)
- Maria Emilia Fernandez
- Experimental Gerontology Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Jorge Martinez-Romero
- Experimental Gerontology Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
- Laboratory of Epidemiology and Population Science, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Miguel A Aon
- Experimental Gerontology Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
- Laboratory of Cardiovascular Science, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Michel Bernier
- Experimental Gerontology Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Nathan L Price
- Experimental Gerontology Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Rafael de Cabo
- Experimental Gerontology Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA.
| |
Collapse
|
12
|
Rubinov M. Circular and unified analysis in network neuroscience. eLife 2023; 12:e79559. [PMID: 38014843 PMCID: PMC10684154 DOI: 10.7554/elife.79559] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 10/18/2023] [Indexed: 11/29/2023] Open
Abstract
Genuinely new discovery transcends existing knowledge. Despite this, many analyses in systems neuroscience neglect to test new speculative hypotheses against benchmark empirical facts. Some of these analyses inadvertently use circular reasoning to present existing knowledge as new discovery. Here, I discuss that this problem can confound key results and estimate that it has affected more than three thousand studies in network neuroscience over the last decade. I suggest that future studies can reduce this problem by limiting the use of speculative evidence, integrating existing knowledge into benchmark models, and rigorously testing proposed discoveries against these models. I conclude with a summary of practical challenges and recommendations.
Collapse
Affiliation(s)
- Mika Rubinov
- Departments of Biomedical Engineering, Computer Science, and Psychology, Vanderbilt UniversityNashvilleUnited States
- Janelia Research Campus, Howard Hughes Medical InstituteAshburnUnited States
| |
Collapse
|
13
|
Yang J, Liu Y, Shang J, Chen Q, Chen Q, Ren L, Zhang N, Yu Y, Li Z, Song Y, Yang S, Scherer A, Tong W, Hong H, Xiao W, Shi L, Zheng Y. The Quartet Data Portal: integration of community-wide resources for multiomics quality control. Genome Biol 2023; 24:245. [PMID: 37884999 PMCID: PMC10601216 DOI: 10.1186/s13059-023-03091-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 10/17/2023] [Indexed: 10/28/2023] Open
Abstract
The Quartet Data Portal facilitates community access to well-characterized reference materials, reference datasets, and related resources established based on a family of four individuals with identical twins from the Quartet Project. Users can request DNA, RNA, protein, and metabolite reference materials, as well as datasets generated across omics, platforms, labs, protocols, and batches. Reproducible analysis tools allow for objective performance assessment of user-submitted data, while interactive visualization tools support rapid exploration of reference datasets. A closed-loop "distribution-collection-evaluation-integration" workflow enables updates and integration of community-contributed multiomics data. Ultimately, this portal helps promote the advancement of reference datasets and multiomics quality control.
Collapse
Affiliation(s)
- Jingcheng Yang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
- Greater Bay Area Institute of Precision Medicine, Guangzhou, Guangdong, China
| | - Yaqing Liu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Jun Shang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qiaochu Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qingwang Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Luyao Ren
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Naixin Zhang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Ying Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Zhihui Li
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Yueqiang Song
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Shengpeng Yang
- Intelligent Storage, Alibaba Cloud, Alibaba Group, Hangzhou, Zhejiang, China
| | - Andreas Scherer
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- EATRIS ERIC-European Infrastructure for Translational Medicine, Amsterdam, the Netherlands
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Wenming Xiao
- Office of Oncological Diseases, Office of New Drugs, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, USA
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China.
- International Human Phenome Institutes (Shanghai), Shanghai, China.
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China.
| |
Collapse
|
14
|
Wei L, Niraula D, Gates EDH, Fu J, Luo Y, Nyflot MJ, Bowen SR, El Naqa IM, Cui S. Artificial intelligence (AI) and machine learning (ML) in precision oncology: a review on enhancing discoverability through multiomics integration. Br J Radiol 2023; 96:20230211. [PMID: 37660402 PMCID: PMC10546458 DOI: 10.1259/bjr.20230211] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 06/15/2023] [Accepted: 06/27/2023] [Indexed: 09/05/2023] Open
Abstract
Multiomics data including imaging radiomics and various types of molecular biomarkers have been increasingly investigated for better diagnosis and therapy in the era of precision oncology. Artificial intelligence (AI) including machine learning (ML) and deep learning (DL) techniques combined with the exponential growth of multiomics data may have great potential to revolutionize cancer subtyping, risk stratification, prognostication, prediction and clinical decision-making. In this article, we first present different categories of multiomics data and their roles in diagnosis and therapy. Second, AI-based data fusion methods and modeling methods as well as different validation schemes are illustrated. Third, the applications and examples of multiomics research in oncology are demonstrated. Finally, the challenges regarding the heterogeneity data set, availability of omics data, and validation of the research are discussed. The transition of multiomics research to real clinics still requires consistent efforts in standardizing omics data collection and analysis, building computational infrastructure for data sharing and storing, developing advanced methods to improve data fusion and interpretability, and ultimately, conducting large-scale prospective clinical trials to fill the gap between study findings and clinical benefits.
Collapse
Affiliation(s)
- Lise Wei
- Department of Radiation Oncology, University of Michigan, Michigan, United States
| | - Dipesh Niraula
- Department of Radiation Oncology, Moffitt Cancer Center, Tampa, United States
| | - Evan D. H. Gates
- Department of Radiation Oncology, University of Washington, Washington, United States
| | - Jie Fu
- Department of Radiation Oncology, Stanford University, Stanford, California, United States
| | - Yi Luo
- Department of Radiation Oncology, Moffitt Cancer Center, Tampa, United States
| | - Matthew J. Nyflot
- Department of Radiation Oncology, University of Washington, Washington, United States
| | - Stephen R. Bowen
- Department of Radiation Oncology, University of Washington, Washington, United States
| | - Issam M El Naqa
- Department of Radiation Oncology, Moffitt Cancer Center, Tampa, United States
| | - Sunan Cui
- Department of Radiation Oncology, University of Washington, Washington, United States
| |
Collapse
|
15
|
Sonrel A, Luetge A, Soneson C, Mallona I, Germain PL, Knyazev S, Gilis J, Gerber R, Seurinck R, Paul D, Sonder E, Crowell HL, Fanaswala I, Al-Ajami A, Heidari E, Schmeing S, Milosavljevic S, Saeys Y, Mangul S, Robinson MD. Meta-analysis of (single-cell method) benchmarks reveals the need for extensibility and interoperability. Genome Biol 2023; 24:119. [PMID: 37198712 PMCID: PMC10189979 DOI: 10.1186/s13059-023-02962-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 05/06/2023] [Indexed: 05/19/2023] Open
Abstract
Computational methods represent the lifeblood of modern molecular biology. Benchmarking is important for all methods, but with a focus here on computational methods, benchmarking is critical to dissect important steps of analysis pipelines, formally assess performance across common situations as well as edge cases, and ultimately guide users on what tools to use. Benchmarking can also be important for community building and advancing methods in a principled way. We conducted a meta-analysis of recent single-cell benchmarks to summarize the scope, extensibility, and neutrality, as well as technical features and whether best practices in open data and reproducible research were followed. The results highlight that while benchmarks often make code available and are in principle reproducible, they remain difficult to extend, for example, as new methods and new ways to assess methods emerge. In addition, embracing containerization and workflow systems would enhance reusability of intermediate benchmarking results, thus also driving wider adoption.
Collapse
Affiliation(s)
- Anthony Sonrel
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Almut Luetge
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Charlotte Soneson
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
| | - Izaskun Mallona
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland
| | - Pierre-Luc Germain
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- D-HEST Institute for Neuroscience, ETH Zürich, Zurich, Switzerland
| | - Sergey Knyazev
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, USA
| | - Jeroen Gilis
- Department of Applied Mathematics, Computer Science & Statistics, Ghent University, Ghent, Belgium
- Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium
- Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium
| | - Reto Gerber
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Ruth Seurinck
- Department of Applied Mathematics, Computer Science & Statistics, Ghent University, Ghent, Belgium
- Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium
| | - Dominique Paul
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
| | - Emanuel Sonder
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- D-HEST Institute for Neuroscience, ETH Zürich, Zurich, Switzerland
| | - Helena L Crowell
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Imran Fanaswala
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Ahmad Al-Ajami
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Elyas Heidari
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Stephan Schmeing
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Stefan Milosavljevic
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland
| | - Yvan Saeys
- Department of Applied Mathematics, Computer Science & Statistics, Ghent University, Ghent, Belgium
- Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, USA
| | - Mark D Robinson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland.
| |
Collapse
|
16
|
Lazzardi S, Valle F, Mazzolini A, Scialdone A, Caselle M, Osella M. Emergent statistical laws in single-cell transcriptomic data. Phys Rev E 2023; 107:044403. [PMID: 37198814 DOI: 10.1103/physreve.107.044403] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 03/24/2023] [Indexed: 05/19/2023]
Abstract
Large-scale data on single-cell gene expression have the potential to unravel the specific transcriptional programs of different cell types. The structure of these expression datasets suggests a similarity with several other complex systems that can be analogously described through the statistics of their basic building blocks. Transcriptomes of single cells are collections of messenger RNA abundances transcribed from a common set of genes just as books are different collections of words from a shared vocabulary, genomes of different species are specific compositions of genes belonging to evolutionary families, and ecological niches can be described by their species abundances. Following this analogy, we identify several emergent statistical laws in single-cell transcriptomic data closely similar to regularities found in linguistics, ecology, or genomics. A simple mathematical framework can be used to analyze the relations between different laws and the possible mechanisms behind their ubiquity. Importantly, treatable statistical models can be useful tools in transcriptomics to disentangle the actual biological variability from general statistical effects present in most component systems and from the consequences of the sampling process inherent to the experimental technique.
Collapse
Affiliation(s)
- Silvia Lazzardi
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Filippo Valle
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Andrea Mazzolini
- Laboratoire de Physique de l'École Normale Supérieure (PSL University), CNRS, Sorbonne Université and Université de Paris, 75005 Paris, France
| | - Antonio Scialdone
- Institute of Epigenetics and Stem Cells, Helmholtz Zentrum München, Feodor-Lynen-Straße 21, 81377 München, Germany and Institute of Functional Epigenetics and Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstraße 1, 85764 Neuherberg, Germany
| | - Michele Caselle
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Matteo Osella
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| |
Collapse
|
17
|
Crowell HL, Morillo Leonardo SX, Soneson C, Robinson MD. The shaky foundations of simulating single-cell RNA sequencing data. Genome Biol 2023; 24:62. [PMID: 36991470 PMCID: PMC10061781 DOI: 10.1186/s13059-023-02904-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 03/20/2023] [Indexed: 03/31/2023] Open
Abstract
BACKGROUND With the emergence of hundreds of single-cell RNA-sequencing (scRNA-seq) datasets, the number of computational tools to analyze aspects of the generated data has grown rapidly. As a result, there is a recurring need to demonstrate whether newly developed methods are truly performant-on their own as well as in comparison to existing tools. Benchmark studies aim to consolidate the space of available methods for a given task and often use simulated data that provide a ground truth for evaluations, thus demanding a high quality standard results credible and transferable to real data. RESULTS Here, we evaluated methods for synthetic scRNA-seq data generation in their ability to mimic experimental data. Besides comparing gene- and cell-level quality control summaries in both one- and two-dimensional settings, we further quantified these at the batch- and cluster-level. Secondly, we investigate the effect of simulators on clustering and batch correction method comparisons, and, thirdly, which and to what extent quality control summaries can capture reference-simulation similarity. CONCLUSIONS Our results suggest that most simulators are unable to accommodate complex designs without introducing artificial effects, they yield over-optimistic performance of integration and potentially unreliable ranking of clustering methods, and it is generally unknown which summaries are important to ensure effective simulation-based method comparisons.
Collapse
Affiliation(s)
- Helena L Crowell
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| | | | - Charlotte Soneson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
- Current address: Friedrich Miescher Institute for Biomedical Research and SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mark D Robinson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland.
| |
Collapse
|
18
|
Johns M, Meurers T, Wirth FN, Haber AC, Müller A, Halilovic M, Balzer F, Prasser F. Data Provenance in Biomedical Research: Scoping Review. J Med Internet Res 2023; 25:e42289. [PMID: 36972116 PMCID: PMC10132013 DOI: 10.2196/42289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 12/14/2022] [Accepted: 12/23/2022] [Indexed: 03/29/2023] Open
Abstract
BACKGROUND Data provenance refers to the origin, processing, and movement of data. Reliable and precise knowledge about data provenance has great potential to improve reproducibility as well as quality in biomedical research and, therefore, to foster good scientific practice. However, despite the increasing interest on data provenance technologies in the literature and their implementation in other disciplines, these technologies have not yet been widely adopted in biomedical research. OBJECTIVE The aim of this scoping review was to provide a structured overview of the body of knowledge on provenance methods in biomedical research by systematizing articles covering data provenance technologies developed for or used in this application area; describing and comparing the functionalities as well as the design of the provenance technologies used; and identifying gaps in the literature, which could provide opportunities for future research on technologies that could receive more widespread adoption. METHODS Following a methodological framework for scoping studies and the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines, articles were identified by searching the PubMed, IEEE Xplore, and Web of Science databases and subsequently screened for eligibility. We included original articles covering software-based provenance management for scientific research published between 2010 and 2021. A set of data items was defined along the following five axes: publication metadata, application scope, provenance aspects covered, data representation, and functionalities. The data items were extracted from the articles, stored in a charting spreadsheet, and summarized in tables and figures. RESULTS We identified 44 original articles published between 2010 and 2021. We found that the solutions described were heterogeneous along all axes. We also identified relationships among motivations for the use of provenance information, feature sets (capture, storage, retrieval, visualization, and analysis), and implementation details such as the data models and technologies used. The important gap that we identified is that only a few publications address the analysis of provenance data or use established provenance standards, such as PROV. CONCLUSIONS The heterogeneity of provenance methods, models, and implementations found in the literature points to the lack of a unified understanding of provenance concepts for biomedical data. Providing a common framework, a biomedical reference, and benchmarking data sets could foster the development of more comprehensive provenance solutions.
Collapse
Affiliation(s)
- Marco Johns
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Thierry Meurers
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Felix N Wirth
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Anna C Haber
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Armin Müller
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Mehmed Halilovic
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Felix Balzer
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Fabian Prasser
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| |
Collapse
|
19
|
Griffin AT, Vlahos LJ, Chiuzan C, Califano A. NaRnEA: An Information Theoretic Framework for Gene Set Analysis. ENTROPY (BASEL, SWITZERLAND) 2023; 25:e25030542. [PMID: 36981431 PMCID: PMC10048242 DOI: 10.3390/e25030542] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 03/03/2023] [Accepted: 03/13/2023] [Indexed: 05/26/2023]
Abstract
Gene sets are being increasingly leveraged to make high-level biological inferences from transcriptomic data; however, existing gene set analysis methods rely on overly conservative, heuristic approaches for quantifying the statistical significance of gene set enrichment. We created Nonparametric analytical-Rank-based Enrichment Analysis (NaRnEA) to facilitate accurate and robust gene set analysis with an optimal null model derived using the information theoretic Principle of Maximum Entropy. By measuring the differential activity of ~2500 transcriptional regulatory proteins based on the differential expression of each protein's transcriptional targets between primary tumors and normal tissue samples in three cohorts from The Cancer Genome Atlas (TCGA), we demonstrate that NaRnEA critically improves in two widely used gene set analysis methods: Gene Set Enrichment Analysis (GSEA) and analytical-Rank-based Enrichment Analysis (aREA). We show that the NaRnEA-inferred differential protein activity is significantly correlated with differential protein abundance inferred from independent, phenotype-matched mass spectrometry data in the Clinical Proteomic Tumor Analysis Consortium (CPTAC), confirming the statistical and biological accuracy of our approach. Additionally, our analysis crucially demonstrates that the sample-shuffling empirical null models leveraged by GSEA and aREA for gene set analysis are overly conservative, a shortcoming that is avoided by the newly developed Maximum Entropy analytical null model employed by NaRnEA.
Collapse
Affiliation(s)
- Aaron T. Griffin
- Medical Scientist Training Program, Columbia University Irving Medical Center, New York, NY 10032, USA
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Lukas J. Vlahos
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Codruta Chiuzan
- Department of Biostatistics, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Andrea Califano
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032, USA
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA
- Department of Medicine, Vagelos College of Physicians and Surgeons, Columbia University, New York, NY 10032, USA
- JP Sulzberger Columbia Genome Center, Columbia University Irving Medical Center, New York, NY 10032, USA
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
- Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY 10032, USA
| |
Collapse
|
20
|
Łabaj PP, Dopazo J, Xiao W, Kreil DP. Editorial: Critical assessment of massive data analysis (CAMDA) annual conference 2021. Front Genet 2023; 14:1154398. [PMID: 36873943 PMCID: PMC9978925 DOI: 10.3389/fgene.2023.1154398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 02/06/2023] [Indexed: 02/18/2023] Open
Affiliation(s)
- Paweł P Łabaj
- Małopolska Centre of Biotechnology, Jagiellonian University, Kraków, Lesser Poland, Poland
| | - Joaquin Dopazo
- Computational Medicine Platform, Andalusian Public Foundation Progress and Health-FPS, Sevilla, Spain.,Institute of Biomedicine of Seville, IBiS, University Hospital Virgen del Rocío/CSIC/University of Sevilla, Sevilla, Spain
| | - Wenzhong Xiao
- Genome Technology Center, School of Medicine, Stanford University, Palo Alto, CA, United States.,Massachusetts General Hospital, Harvard Medical School, Boston, MA, United States
| | - David P Kreil
- Department of Biotechnology, Boku University Vienna, Vienna, Austria
| |
Collapse
|
21
|
Sarwal V, Brito J, Mangul S, Koslicki D. TAMPA: interpretable analysis and visualization of metagenomics-based taxon abundance profiles. Gigascience 2022; 12:giad008. [PMID: 36852763 PMCID: PMC9972184 DOI: 10.1093/gigascience/giad008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Revised: 11/12/2022] [Accepted: 02/02/2023] [Indexed: 03/01/2023] Open
Abstract
BACKGROUND Metagenomic taxonomic profiling aims to predict the identity and relative abundance of taxa in a given whole-genome sequencing metagenomic sample. A recent surge in computational methods that aim to accurately estimate taxonomic profiles, called taxonomic profilers, has motivated community-driven efforts to create standardized benchmarking datasets and platforms, standardized taxonomic profile formats, and a benchmarking platform to assess tool performance. While this standardization is essential, there is currently a lack of tools to visualize the standardized output of the many existing taxonomic profilers. Thus, benchmarking studies rely on a single-value metrics to compare performance of tools and compare to benchmarking datasets. This is one of the major problems in analyzing metagenomic profiling data, since single metrics, such as the F1 score, fail to capture the biological differences between the datasets. FINDINGS Here we report the development of TAMPA (Taxonomic metagenome profiling evaluation), a robust and easy-to-use method that allows scientists to easily interpret and interact with taxonomic profiles produced by the many different taxonomic profiler methods beyond the standard metrics used by the scientific community. We demonstrate the unique ability of TAMPA to generate a novel biological hypothesis by highlighting the taxonomic differences between samples otherwise missed by commonly utilized metrics. CONCLUSION In this study, we show that TAMPA can help visualize the output of taxonomic profilers, enabling biologists to effectively choose the most appropriate profiling method to use on their metagenomics data. TAMPA is available on GitHub, Bioconda, and Galaxy Toolshed at https://github.com/dkoslicki/TAMPA and is released under the MIT license.
Collapse
Affiliation(s)
- Varuni Sarwal
- Department of Computer Science, University of California–Los Angeles, Los Angeles, CA 90095, USA
| | - Jaqueline Brito
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences,University of Southern California, Los Angeles, CA 90089, USA
| | - Serghei Mangul
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences,University of Southern California, Los Angeles, CA 90089, USA
- Department of Quantitative and Computational Biology, USC Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - David Koslicki
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA
- Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
22
|
Tosco-Herrera E, Muñoz-Barrera A, Jáspez D, Rubio-Rodríguez LA, Mendoza-Alvarez A, Rodriguez-Perez H, Jou J, Iñigo-Campos A, Corrales A, Ciuffreda L, Martinez-Bugallo F, Prieto-Morin C, García-Olivares V, González-Montelongo R, Lorenzo-Salazar JM, Marcelino-Rodriguez I, Flores C. Evaluation of a whole-exome sequencing pipeline and benchmarking of causal germline variant prioritizers. Hum Mutat 2022; 43:2010-2020. [PMID: 36054330 DOI: 10.1002/humu.24459] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Revised: 08/20/2022] [Accepted: 08/30/2022] [Indexed: 01/25/2023]
Abstract
Most causal variants of Mendelian diseases are exonic. Whole-exome sequencing (WES) has become the diagnostic gold standard, but causative variant prioritization constitutes a bottleneck. Here we assessed an in-house sample-to-sequence pipeline and benchmarked free prioritization tools for germline causal variants from WES data. WES of 61 unselected patients with a known genetic disease cause was obtained. Variant prioritizations were performed by diverse tools and recorded to obtain a diagnostic yield when the causal variant was present in the first, fifth, and 10th top rankings. A fraction of causal variants was not captured by WES (8.2%) or did not pass the quality control criteria (13.1%). Most of the applications inspected were unavailable or had technical limitations, leaving nine tools for complete evaluation. Exomiser performed best in the top first rankings, while LIRICAL led in the top fifth rankings. Based on the more conservative top 10th rankings, Xrare had the highest diagnostic yield, followed by a three-way tie among Exomiser, LIRICAL, and PhenIX, then followed by AMELIE, TAPES, Phen-Gen, AIVar, and VarNote-PAT. Xrare, Exomiser, LIRICAL, and PhenIX are the most efficient options for variant prioritization in real patient WES data.
Collapse
Affiliation(s)
- Eva Tosco-Herrera
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain.,Escuela de Doctorado y Estudios de Posgrado de la Universidad de La Laguna (EDEPULL), Universidad de La Laguna (ULL), San Cristóbal de La Laguna, Spain
| | - Adrián Muñoz-Barrera
- Escuela de Doctorado y Estudios de Posgrado de la Universidad de La Laguna (EDEPULL), Universidad de La Laguna (ULL), San Cristóbal de La Laguna, Spain.,Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Granadilla de Abona, Spain
| | - David Jáspez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Granadilla de Abona, Spain
| | - Luis A Rubio-Rodríguez
- Escuela de Doctorado y Estudios de Posgrado de la Universidad de La Laguna (EDEPULL), Universidad de La Laguna (ULL), San Cristóbal de La Laguna, Spain.,Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Granadilla de Abona, Spain
| | - Alejandro Mendoza-Alvarez
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain.,Escuela de Doctorado y Estudios de Posgrado de la Universidad de La Laguna (EDEPULL), Universidad de La Laguna (ULL), San Cristóbal de La Laguna, Spain
| | - Hector Rodriguez-Perez
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain.,Escuela de Doctorado y Estudios de Posgrado de la Universidad de La Laguna (EDEPULL), Universidad de La Laguna (ULL), San Cristóbal de La Laguna, Spain
| | - Jonathan Jou
- Department of Surgery, University of Illinois College of Medicine, Peoria, Illinois, USA
| | - Antonio Iñigo-Campos
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Granadilla de Abona, Spain
| | - Almudena Corrales
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain.,CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain
| | - Laura Ciuffreda
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain
| | - Francisco Martinez-Bugallo
- Clinical Analysis Service, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain
| | - Carol Prieto-Morin
- Clinical Analysis Service, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain
| | - Víctor García-Olivares
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Granadilla de Abona, Spain
| | | | - Jose Miguel Lorenzo-Salazar
- Escuela de Doctorado y Estudios de Posgrado de la Universidad de La Laguna (EDEPULL), Universidad de La Laguna (ULL), San Cristóbal de La Laguna, Spain.,Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Granadilla de Abona, Spain
| | | | - Carlos Flores
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain.,Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Granadilla de Abona, Spain.,CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain.,Facultad de Ciencias de la Salud, Universidad Fernando Pessoa Canarias, Las Palmas de Gran Canaria, Spain
| |
Collapse
|
23
|
Welch N, Singh SS, Musich R, Mansuri MS, Bellar A, Mishra S, Chelluboyina AK, Sekar J, Attaway AH, Li L, Willard B, Hornberger TA, Dasarathy S. Shared and unique phosphoproteomics responses in skeletal muscle from exercise models and in hyperammonemic myotubes. iScience 2022; 25:105325. [PMID: 36345342 PMCID: PMC9636548 DOI: 10.1016/j.isci.2022.105325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Revised: 08/22/2022] [Accepted: 10/07/2022] [Indexed: 11/06/2022] Open
Abstract
Skeletal muscle generation of ammonia, an endogenous cytotoxin, is increased during exercise. Perturbations in ammonia metabolism consistently occur in chronic diseases, and may blunt beneficial skeletal muscle molecular responses and protein homeostasis with exercise. Phosphorylation of skeletal muscle proteins mediates cellular signaling responses to hyperammonemia and exercise. Comparative bioinformatics and machine learning-based analyses of published and experimentally derived phosphoproteomics data identified differentially expressed phosphoproteins that were unique and shared between hyperammonemic murine myotubes and skeletal muscle from exercise models. Enriched processes identified in both hyperammonemic myotubes and muscle from exercise models with selected experimental validation included protein kinase A (PKA), calcium signaling, mitogen-activated protein kinase (MAPK) signaling, and protein homeostasis. Our approach of feature extraction from comparative untargeted "omics" data allows for selection of preclinical models that recapitulate specific human exercise responses and potentially optimize functional capacity and skeletal muscle protein homeostasis with exercise in chronic diseases.
Collapse
Affiliation(s)
- Nicole Welch
- Department of Inflammation and Immunity, Cleveland Clinic, Cleveland, OH 44195, USA
- Department of Gastroenterology and Hepatology, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Shashi Shekhar Singh
- Department of Inflammation and Immunity, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Ryan Musich
- Department of Inflammation and Immunity, Cleveland Clinic, Cleveland, OH 44195, USA
| | - M. Shahid Mansuri
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Annette Bellar
- Department of Inflammation and Immunity, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Saurabh Mishra
- Department of Inflammation and Immunity, Cleveland Clinic, Cleveland, OH 44195, USA
| | | | - Jinendiran Sekar
- Department of Inflammation and Immunity, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Amy H. Attaway
- Department of Inflammation and Immunity, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Ling Li
- Proteomics Core, Cleveland Clinic, Cleveland, OH 44195, USA
- Department of Comparative Biosciences, School of Veterinary Medicine, University of Wisconsin, Madison, WI 53706, USA
| | - Belinda Willard
- Proteomics Core, Cleveland Clinic, Cleveland, OH 44195, USA
- Department of Comparative Biosciences, School of Veterinary Medicine, University of Wisconsin, Madison, WI 53706, USA
| | - Troy A. Hornberger
- Department of Comparative Biosciences, School of Veterinary Medicine, University of Wisconsin, Madison, WI 53706, USA
| | - Srinivasan Dasarathy
- Department of Inflammation and Immunity, Cleveland Clinic, Cleveland, OH 44195, USA
- Department of Gastroenterology and Hepatology, Cleveland Clinic, Cleveland, OH 44195, USA
| |
Collapse
|
24
|
Raufaste-Cazavieille V, Santiago R, Droit A. Multi-omics analysis: Paving the path toward achieving precision medicine in cancer treatment and immuno-oncology. Front Mol Biosci 2022; 9:962743. [PMID: 36304921 PMCID: PMC9595279 DOI: 10.3389/fmolb.2022.962743] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Accepted: 09/21/2022] [Indexed: 11/13/2022] Open
Abstract
The acceleration of large-scale sequencing and the progress in high-throughput computational analyses, defined as omics, was a hallmark for the comprehension of the biological processes in human health and diseases. In cancerology, the omics approach, initiated by genomics and transcriptomics studies, has revealed an incredible complexity with unsuspected molecular diversity within a same tumor type as well as spatial and temporal heterogeneity of tumors. The integration of multiple biological layers of omics studies brought oncology to a new paradigm, from tumor site classification to pan-cancer molecular classification, offering new therapeutic opportunities for precision medicine. In this review, we will provide a comprehensive overview of the latest innovations for multi-omics integration in oncology and summarize the largest multi-omics dataset available for adult and pediatric cancers. We will present multi-omics techniques for characterizing cancer biology and show how multi-omics data can be combined with clinical data for the identification of prognostic and treatment-specific biomarkers, opening the way to personalized therapy. To conclude, we will detail the newest strategies for dissecting the tumor immune environment and host–tumor interaction. We will explore the advances in immunomics and microbiomics for biomarker identification to guide therapeutic decision in immuno-oncology.
Collapse
Affiliation(s)
| | - Raoul Santiago
- CHU de Québec Research Center, Université Laval, Québec, QC, Canada
- Division of Pediatric Hematology-Oncology, Centre Hospitalier Universitaire de L’Université Laval, Charles Bruneau Cancer Center, Québec, QC, Canada
- *Correspondence: Raoul Santiago, ; Arnaud Droit,
| | - Arnaud Droit
- CHU de Québec Research Center, Université Laval, Québec, QC, Canada
- *Correspondence: Raoul Santiago, ; Arnaud Droit,
| |
Collapse
|
25
|
Zhao C, Dong J, Deng L, Tan Y, Jiang W, Cai Z. Molecular network strategy in multi-omics and mass spectrometry imaging. Curr Opin Chem Biol 2022; 70:102199. [PMID: 36027696 DOI: 10.1016/j.cbpa.2022.102199] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Revised: 06/01/2022] [Accepted: 07/10/2022] [Indexed: 11/30/2022]
Abstract
Human physiological activities and pathological changes arise from the coordinated interactions of multiple molecules. Mass spectrometry (MS)-based multi-omics and MS imaging (MSI)-based spatial omics are powerful methods used to investigate molecular information related to the phenotype of interest from homogenated or sliced samples, including the qualitative, relative quantitative and spatial distributions. Molecular network strategy provides efficient methods to help us understand and mine the biological patterns behind the phenotypic data. It illustrates and combines various relationships between molecules, and further performs the molecule identification and biological interpretation. Here, we describe the recent advances of network-based analysis and its applications for different biological processes, such as, obesity, central nervous system diseases, and environmental toxicology.
Collapse
Affiliation(s)
- Chao Zhao
- Bionic Sensing and Intelligence Center, Institute of Biomedical and Health Engineering, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Jiyang Dong
- Department of Electronic Science, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, China
| | - Lingli Deng
- Department of Information Engineering, East China University of Technology, China
| | - Yawen Tan
- Department of Breast and Thyroid Surgery, Shenzhen Second People's Hospital, Shenzhen, China
| | - Wei Jiang
- Department of Radiation Oncology, National Cancer Center/National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Shenzhen, China
| | - Zongwei Cai
- State Key Laboratory of Environmental and Biological Analysis, Department of Chemistry, Hong Kong Baptist University, Hong Kong SAR, China.
| |
Collapse
|
26
|
Alser M, Lindegger J, Firtina C, Almadhoun N, Mao H, Singh G, Gomez-Luna J, Mutlu O. From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures. Comput Struct Biotechnol J 2022; 20:4579-4599. [PMID: 36090814 PMCID: PMC9436709 DOI: 10.1016/j.csbj.2022.08.019] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 08/08/2022] [Accepted: 08/08/2022] [Indexed: 02/01/2023] Open
Abstract
We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent.
Collapse
Affiliation(s)
| | | | - Can Firtina
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | - Haiyu Mao
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | | | - Onur Mutlu
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| |
Collapse
|
27
|
Valecha M, Posada D. Somatic variant calling from single-cell DNA sequencing data. Comput Struct Biotechnol J 2022; 20:2978-2985. [PMID: 35782734 PMCID: PMC9218383 DOI: 10.1016/j.csbj.2022.06.013] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Revised: 06/06/2022] [Accepted: 06/06/2022] [Indexed: 11/03/2022] Open
Abstract
Single-cell sequencing has gained popularity in recent years. Despite its numerous applications, single-cell DNA sequencing data is highly error-prone due to technical biases arising from uneven sequencing coverage, allelic dropout, and amplification error. With these artifacts, the identification of somatic genomic variants becomes a challenging task, and over the years, several methods have been developed explicitly for this type of data. Single-cell variant callers implement distinct strategies, make different use of the data, and typically result in many discordant calls when applied to real data. Here, we review current approaches for single-cell variant calling, emphasizing single nucleotide variants. We highlight their potential benefits and shortcomings to help users choose a suitable tool for their data at hand.
Collapse
Key Words
- ADO, allelic dropout
- Allele dropout
- Amplification error
- CNV, copy number variant
- Indel, short insertion or deletion
- LDO, locus dropout
- SNV, single nucleotide variant
- SV, structural variant
- Single-cell genomics
- Somatic variants
- VAF, variant allele frequency
- Variant calling
- hSNP, heterozygous single-nucleotide polymorphism
- scATAC-seq, single-cell sequencing assay for transposase-accessible chromatin
- scDNA-seq, single-cell DNA sequencing
- scHi-C, single-cell Hi-C sequencing
- scMethyl-seq, single-cell Methylation sequencing
- scRNA-seq, single-cell RNA sequencing
- scWGA, single-cell whole-genome amplification
Collapse
Affiliation(s)
- Monica Valecha
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Spain
| | - David Posada
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Spain
- Department of Biochemistry, Genetics, and Immunology, Universidade de Vigo, 36310 Vigo, Spain
| |
Collapse
|
28
|
Plyusnin I, Truong Nguyen PT, Sironen T, Vapalahti O, Smura T, Kant R. ClusTRace, a bioinformatic pipeline for analyzing clusters in virus phylogenies. BMC Bioinformatics 2022; 23:196. [PMID: 35643449 PMCID: PMC9143711 DOI: 10.1186/s12859-022-04709-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Accepted: 05/04/2022] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND SARS-CoV-2 is the highly transmissible etiologic agent of coronavirus disease 2019 (COVID-19) and has become a global scientific and public health challenge since December 2019. Several new variants of SARS-CoV-2 have emerged globally raising concern about prevention and treatment of COVID-19. Early detection and in-depth analysis of the emerging variants allowing pre-emptive alert and mitigation efforts are thus of paramount importance. RESULTS Here we present ClusTRace, a novel bioinformatic pipeline for a fast and scalable analysis of sequence clusters or clades in large viral phylogenies. ClusTRace offers several high-level functionalities including lineage assignment, outlier filtering, aligning, phylogenetic tree reconstruction, cluster extraction, variant calling, visualization and reporting. ClusTRace was developed as an aid for COVID-19 transmission chain tracing in Finland with the main emphasis on fast screening of phylogenies for markers of super-spreading events and other features of concern, such as high rates of cluster growth and/or accumulation of novel mutations. CONCLUSIONS ClusTRace provides an effective interface that can significantly cut down learning and operating costs related to complex bioinformatic analysis of large viral sequence sets and phylogenies. All code is freely available from https://bitbucket.org/plyusnin/clustrace/.
Collapse
Affiliation(s)
- Ilya Plyusnin
- Department of Veterinary Bioscience, University of Helsinki, 00014, Helsinki, Finland.
- Department of Virology, University of Helsinki, 00014, Helsinki, Finland.
| | | | - Tarja Sironen
- Department of Veterinary Bioscience, University of Helsinki, 00014, Helsinki, Finland
- Department of Virology, University of Helsinki, 00014, Helsinki, Finland
| | - Olli Vapalahti
- Department of Veterinary Bioscience, University of Helsinki, 00014, Helsinki, Finland
- Department of Virology, University of Helsinki, 00014, Helsinki, Finland
- Department of Virology and Immunology, Helsinki University Hospital, Diagnostic Center, 00029, Helsinki, Finland
| | - Teemu Smura
- Department of Virology, University of Helsinki, 00014, Helsinki, Finland
- Department of Virology and Immunology, Helsinki University Hospital, Diagnostic Center, 00029, Helsinki, Finland
| | - Ravi Kant
- Department of Veterinary Bioscience, University of Helsinki, 00014, Helsinki, Finland
- Department of Virology, University of Helsinki, 00014, Helsinki, Finland
| |
Collapse
|
29
|
Gonzalez-Reymundez A, Grueneberg A, Lu G, Alves FC, Rincon G, Vazquez AI. MOSS: multi-omic integration with sparse value decomposition. Bioinformatics 2022; 38:2956-2958. [PMID: 35561193 PMCID: PMC9113319 DOI: 10.1093/bioinformatics/btac179] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Revised: 03/07/2022] [Accepted: 03/23/2022] [Indexed: 02/03/2023] Open
Abstract
SUMMARY This article presents multi-omic integration with sparse value decomposition (MOSS), a free and open-source R package for integration and feature selection in multiple large omics datasets. This package is computationally efficient and offers biological insight through capabilities, such as cluster analysis and identification of informative omic features. AVAILABILITY AND IMPLEMENTATION https://CRAN.R-project.org/package=MOSS. SUPPLEMENTARY INFORMATION Supplementary information can be found at https://github.com/agugonrey/GonzalezReymundez2021.
Collapse
Affiliation(s)
| | - Alexander Grueneberg
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
| | - Guanqi Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
| | - Filipe Couto Alves
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
| | - Gonzalo Rincon
- Genus PLC Inc., Genome Sciences R&D, De Forest, WI 53532, USA
| | - Ana I Vazquez
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
30
|
Macedo-da-Silva J, Coutinho JVP, Rosa-Fernandes L, Marie SKN, Palmisano G. Exploring COVID-19 pathogenesis on command-line: A bioinformatics pipeline for handling and integrating omics data. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2022; 131:311-339. [PMID: 35871895 PMCID: PMC9095070 DOI: 10.1016/bs.apcsb.2022.04.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was first identified in late 2019 in Wuhan, China, and has proven to be highly pathogenic, making it a global public health threat. The immediate need to understand the mechanisms and impact of the virus made omics techniques stand out, as they can offer a holistic and comprehensive view of thousands of molecules in a single experiment. Mastering bioinformatics tools to process, analyze, integrate, and interpret omics data is a powerful knowledge to enrich results. We present a robust and open access computational pipeline for extracting information from quantitative proteomics and transcriptomics public data. We present the entire pipeline from raw data to differentially expressed genes. We explore processes and pathways related to mapped transcripts and proteins. A pipeline is presented to integrate and compare proteomics and transcriptomics data using also packages available in the Bioconductor and providing the codes used. Cholesterol metabolism, immune system activity, ECM, and proteasomal degradation pathways increased in infected patients. Leukocyte activation profile was overrepresented in both proteomics and transcriptomics data. Finally, we found a panel of proteins and transcripts regulated in the same direction in the lung transcriptome and plasma proteome that distinguish healthy and infected individuals. This panel of markers was confirmed in another cohort of patients, thus validating the robustness and functionality of the tools presented.
Collapse
Affiliation(s)
- Janaina Macedo-da-Silva
- GlycoProteomics Laboratory, Department of Parasitology, ICB, University of São Paulo, São Paulo, Brazil
| | | | - Livia Rosa-Fernandes
- GlycoProteomics Laboratory, Department of Parasitology, ICB, University of São Paulo, São Paulo, Brazil
| | - Suely Kazue Nagahashi Marie
- Cellular and Molecular Biology Laboratory (LIM 15), Neurology Department, Faculdade de Medicina FMUSP, Universidade de Sao Paulo, Sao Paulo, Brazil
| | - Giuseppe Palmisano
- GlycoProteomics Laboratory, Department of Parasitology, ICB, University of São Paulo, São Paulo, Brazil; School of Natural Sciences, Macquarie University, Sydney, NSW, Australia.
| |
Collapse
|
31
|
García-García N, Tamames J, Puente-Sánchez F. M&Ms: a versatile software for building microbial mock communities. Bioinformatics 2022; 38:2057-2059. [PMID: 35022654 DOI: 10.1093/bioinformatics/btab882] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 12/20/2021] [Accepted: 01/10/2022] [Indexed: 02/03/2023] Open
Abstract
SUMMARY Advances in sequencing technologies have triggered the development of many bioinformatic tools aimed to analyze 16S rDNA sequencing data. As these tools need to be tested, it is important to simulate datasets that resemble samples from different environments. Here, we introduce M&Ms, a user-friendly open-source bioinformatic tool to produce different 16S rDNA datasets from reference sequences, based on pragmatic ecological parameters. It creates sequence libraries for 'in silico' microbial communities with user-controlled richness, evenness, microdiversity and source environment. M&Ms allows the user to generate simple to complex read datasets based on real parameters that can be used in developing bioinformatic software or in benchmarking current tools. AVAILABILITY AND IMPLEMENTATION The source code of M&Ms is freely available at https://github.com/ggnatalia/MMs (GPL-3.0 License). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Natalia García-García
- Department of Systems Biology, Address Centro Nacional de Biotecnología (CNB-CSIC), 28049 Madrid, Spain
| | - Javier Tamames
- Department of Systems Biology, Address Centro Nacional de Biotecnología (CNB-CSIC), 28049 Madrid, Spain
| | - Fernando Puente-Sánchez
- Department of Systems Biology, Address Centro Nacional de Biotecnología (CNB-CSIC), 28049 Madrid, Spain
| |
Collapse
|
32
|
Petrillo M, Fabbri M, Kagkli DM, Querci M, Van den Eede G, Alm E, Aytan-Aktug D, Capella-Gutierrez S, Carrillo C, Cestaro A, Chan KG, Coque T, Endrullat C, Gut I, Hammer P, Kay GL, Madec JY, Mather AE, McHardy AC, Naas T, Paracchini V, Peter S, Pightling A, Raffael B, Rossen J, Ruppé E, Schlaberg R, Vanneste K, Weber LM, Westh H, Angers-Loustau A. A roadmap for the generation of benchmarking resources for antimicrobial resistance detection using next generation sequencing. F1000Res 2022; 10:80. [PMID: 35847383 PMCID: PMC9243550 DOI: 10.12688/f1000research.39214.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/10/2022] [Indexed: 11/20/2022] Open
Abstract
Next Generation Sequencing technologies significantly impact the field of Antimicrobial Resistance (AMR) detection and monitoring, with immediate uses in diagnosis and risk assessment. For this application and in general, considerable challenges remain in demonstrating sufficient trust to act upon the meaningful information produced from raw data, partly because of the reliance on bioinformatics pipelines, which can produce different results and therefore lead to different interpretations. With the constant evolution of the field, it is difficult to identify, harmonise and recommend specific methods for large-scale implementations over time. In this article, we propose to address this challenge through establishing a transparent, performance-based, evaluation approach to provide flexibility in the bioinformatics tools of choice, while demonstrating proficiency in meeting common performance standards. The approach is two-fold: first, a community-driven effort to establish and maintain “live” (dynamic) benchmarking platforms to provide relevant performance metrics, based on different use-cases, that would evolve together with the AMR field; second, agreed and defined datasets to allow the pipelines’ implementation, validation, and quality-control over time. Following previous discussions on the main challenges linked to this approach, we provide concrete recommendations and future steps, related to different aspects of the design of benchmarks, such as the selection and the characteristics of the datasets (quality, choice of pathogens and resistances, etc.), the evaluation criteria of the pipelines, and the way these resources should be deployed in the community.
Collapse
Affiliation(s)
| | - Marco Fabbri
- European Commission Joint Research Centre, Ispra, Italy
| | | | | | - Guy Van den Eede
- European Commission Joint Research Centre, Ispra, Italy
- European Commission Joint Research Centre, Geel, Belgium
| | - Erik Alm
- The European Centre for Disease Prevention and Control, Stockholm, Sweden
| | - Derya Aytan-Aktug
- National Food Institute, Technical University of Denmark, Lyngby, Denmark
| | | | - Catherine Carrillo
- Ottawa Laboratory – Carling, Canadian Food Inspection Agency, Ottawa, Ontario, Canada
| | | | - Kok-Gan Chan
- International Genome Centre, Jiangsu University, Zhenjiang, China
- Division of Genetics and Molecular Biology, Institute of Biological Sciences, Faculty of Science, University of Malaya, Kuala Lumpur, Malaysia
| | - Teresa Coque
- Servicio de Microbiología, Hospital Universitario Ramón y Cajal, Instituto Ramón y Cajal de Investigación Sanitaria (IRYCIS), Madrid, Spain
- Spanish Consortium for Research on Epidemiology and Public Health (CIBERESP), Carlos III Health Institute, Madrid, Spain
| | | | - Ivo Gut
- Centro Nacional de Análisis Genómico, Centre for Genomic Regulation (CNAG-CRG), Barcelona Institute of Technology, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
| | - Paul Hammer
- BIOMES. NGS GmbH c/o Technische Hochschule Wildau, Wildau, Germany
| | - Gemma L. Kay
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
| | - Jean-Yves Madec
- Unité Antibiorésistance et Virulence Bactériennes, ANSES Site de Lyon, Lyon, France
| | - Alison E. Mather
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
- University of East Anglia, Norwich, UK
| | | | - Thierry Naas
- French-NRC for CPEs, Service de Bactériologie-Hygiène, Hôpital de Bicêtre, Le Kremlin-Bicêtre, France
| | | | - Silke Peter
- Institute of Medical Microbiology and Hygiene, University of Tübingen, Tübingen, Germany
| | - Arthur Pightling
- Center for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, USA
| | | | - John Rossen
- Department of Medical Microbiology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | | | - Robert Schlaberg
- Department of Pathology, University of Utah, Salt Lake City, UT, USA
| | - Kevin Vanneste
- Transversal activities in Applied Genomics, Sciensano, Brussels, Belgium
| | - Lukas M. Weber
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
- Present address: Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | | | | |
Collapse
|
33
|
Kundu R, Chattopadhyay S, Cuevas E, Sarkar R. AltWOA: Altruistic Whale Optimization Algorithm for feature selection on microarray datasets. Comput Biol Med 2022; 144:105349. [PMID: 35303580 DOI: 10.1016/j.compbiomed.2022.105349] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Revised: 02/22/2022] [Accepted: 02/22/2022] [Indexed: 12/15/2022]
Abstract
The data-driven modern era has enabled the collection of large amounts of biomedical and clinical data. DNA microarray gene expression datasets have mainly gained significant attention to the research community owing to their ability to identify diseases through the "bio-markers" or specific alterations in the gene sequence that represent that particular disease (for example, different types of cancer). However, gene expression datasets are very high-dimensional, while only a few of those are "bio-markers". Meta-heuristic-based feature selection effectively filters out only the relevant genes from a large set of attributes efficiently to reduce data storage and computation requirements. To this end, in this paper, we propose an Altruistic Whale Optimization Algorithm (AltWOA) for the feature selection problem in high-dimensional microarray data. AltWOA is an improvement on the basic Whale Optimization Algorithm. We embed the concept of altruism in the whale population to help efficient propagation of candidate solutions that can reach the global optima over the iterations. Evaluation of the proposed method on eight high dimensional microarray datasets reveals the superiority of AltWOA compared to popular and classical techniques in the literature on the same datasets both in terms of accuracy and the final number of features selected. The relevant codes for the proposed approach are available publicly at https://github.com/Rohit-Kundu/AltWOA.
Collapse
Affiliation(s)
- Rohit Kundu
- Department of Electrical Engineering, Jadavpur University, Kolkata, 700032, India.
| | - Soham Chattopadhyay
- Department of Electrical Engineering, Jadavpur University, Kolkata, 700032, India.
| | - Erik Cuevas
- Departamento de Electrónica, Universidad de Guadalajara, CUCEI, Av. Revolución 1500, Guadalajara, Jal, Mexico.
| | - Ram Sarkar
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, 700032, India.
| |
Collapse
|
34
|
Chen D, Randhawa GS, Soltysiak MPM, de Souza CPE, Kari L, Singh SM, Hill KA. SomaticSiMu: A mutational signature simulator. Bioinformatics 2022; 38:2619-2620. [PMID: 35258549 DOI: 10.1093/bioinformatics/btac128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 02/01/2022] [Indexed: 11/14/2022] Open
Abstract
SUMMARY SomaticSiMu is an in silico simulator of single and double base substitutions, and single base insertions and deletions in an input genomic sequence to mimic mutational signatures. SomaticSiMu outputs simulated DNA sequences and mutational catalogues with imposed mutational signatures. The tool is the first mutational signature simulator featuring a graphical user interface, control of mutation rates, and built-in visualization tools of the simulated mutations. Simulated datasets are useful as a ground truth to test the accuracy and sensitivity of DNA sequence classification tools and mutational signature extraction tools under different experimental scenarios. The reliability of SomaticSiMu was affirmed by 1) supervised machine learning classification of simulated sequences with different mutation types and burdens, and 2) mutational signature extraction from simulated mutational catalogues. AVAILABILITY AND IMPLEMENTATION SomaticSiMu is written in Python 3.8.3. The open-source code, documentation, and tutorials are available at https://github.com/HillLab/SomaticSiMu under the terms of the Creative Commons Attribution 4.0 International License. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David Chen
- Department of Biology, Western University, London, Ontario, Canada
| | - Gurjit S Randhawa
- School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada
| | | | - Camila P E de Souza
- Department of Statistical and Actuarial Sciences, Western University, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Shiva M Singh
- Department of Biology, Western University, London, Ontario, Canada
| | - Kathleen A Hill
- Department of Biology, Western University, London, Ontario, Canada
| |
Collapse
|
35
|
Brüning RS, Tombor L, Schulz MH, Dimmeler S, John D. Comparative analysis of common alignment tools for single-cell RNA sequencing. Gigascience 2022; 11:giac001. [PMID: 35084033 PMCID: PMC8848315 DOI: 10.1093/gigascience/giac001] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Revised: 10/07/2021] [Accepted: 12/27/2021] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND With the rise of single-cell RNA sequencing new bioinformatic tools have been developed to handle specific demands, such as quantifying unique molecular identifiers and correcting cell barcodes. Here, we benchmarked several datasets with the most common alignment tools for single-cell RNA sequencing data. We evaluated differences in the whitelisting, gene quantification, overall performance, and potential variations in clustering or detection of differentially expressed genes. We compared the tools Cell Ranger version 6, STARsolo, Kallisto, Alevin, and Alevin-fry on 3 published datasets for human and mouse, sequenced with different versions of the 10X sequencing protocol. RESULTS Striking differences were observed in the overall runtime of the mappers. Besides that, Kallisto and Alevin showed variances in the number of valid cells and detected genes per cell. Kallisto reported the highest number of cells; however, we observed an overrepresentation of cells with low gene content and unknown cell type. Conversely, Alevin rarely reported such low-content cells. Further variations were detected in the set of expressed genes. While STARsolo, Cell Ranger 6, Alevin-fry, and Alevin produced similar gene sets, Kallisto detected additional genes from the Vmn and Olfr gene family, which are likely mapping artefacts. We also observed differences in the mitochondrial content of the resulting cells when comparing a prefiltered annotation set to the full annotation set that includes pseudogenes and other biotypes. CONCLUSION Overall, this study provides a detailed comparison of common single-cell RNA sequencing mappers and shows their specific properties on 10X Genomics data.
Collapse
Affiliation(s)
- Ralf Schulze Brüning
- Institute of Cardiovascular Regeneration, Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- Cardio-Pulmonary Institute (CPI), Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
| | - Lukas Tombor
- Institute of Cardiovascular Regeneration, Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- German Center for Cardiovascular Research (DZHK), Potsdamer Str. 58 10785 Berlin, Germany
| | - Marcel H Schulz
- Institute of Cardiovascular Regeneration, Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- Cardio-Pulmonary Institute (CPI), Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- German Center for Cardiovascular Research (DZHK), Potsdamer Str. 58 10785 Berlin, Germany
| | - Stefanie Dimmeler
- Institute of Cardiovascular Regeneration, Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- Cardio-Pulmonary Institute (CPI), Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- German Center for Cardiovascular Research (DZHK), Potsdamer Str. 58 10785 Berlin, Germany
| | - David John
- Institute of Cardiovascular Regeneration, Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- Cardio-Pulmonary Institute (CPI), Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
| |
Collapse
|
36
|
Romano JD, Le TT, La Cava W, Gregg JT, Goldberg DJ, Chakraborty P, Ray NL, Himmelstein D, Fu W, Moore JH. PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods. Bioinformatics 2022; 38:878-880. [PMID: 34677586 PMCID: PMC8756190 DOI: 10.1093/bioinformatics/btab727] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Revised: 08/17/2021] [Accepted: 10/18/2021] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows. RESULTS This release of PMLB (Penn Machine Learning Benchmarks) provides the largest collection of diverse, public benchmark datasets for evaluating new machine learning and data science methods aggregated in one location. v1.0 introduces a number of critical improvements developed following discussions with the open-source community. AVAILABILITY AND IMPLEMENTATION PMLB is available at https://github.com/EpistasisLab/pmlb. Python and R interfaces for PMLB can be installed through the Python Package Index and Comprehensive R Archive Network, respectively.
Collapse
Affiliation(s)
- Joseph D Romano
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Center of Excellence in Environmental Toxicology, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Trang T Le
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - William La Cava
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - John T Gregg
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Daniel J Goldberg
- Department of Computer Science & Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA
| | - Praneel Chakraborty
- School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA 19104, USA
- Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA
| | | | - Daniel Himmelstein
- Related Sciences, Denver, CO 80220, USA
- Department of Systems Pharmacology & Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Weixuan Fu
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
37
|
Van Den Bossche T, Kunath BJ, Schallert K, Schäpe SS, Abraham PE, Armengaud J, Arntzen MØ, Bassignani A, Benndorf D, Fuchs S, Giannone RJ, Griffin TJ, Hagen LH, Halder R, Henry C, Hettich RL, Heyer R, Jagtap P, Jehmlich N, Jensen M, Juste C, Kleiner M, Langella O, Lehmann T, Leith E, May P, Mesuere B, Miotello G, Peters SL, Pible O, Queiros PT, Reichl U, Renard BY, Schiebenhoefer H, Sczyrba A, Tanca A, Trappe K, Trezzi JP, Uzzau S, Verschaffelt P, von Bergen M, Wilmes P, Wolf M, Martens L, Muth T. Critical Assessment of MetaProteome Investigation (CAMPI): a multi-laboratory comparison of established workflows. Nat Commun 2021; 12:7305. [PMID: 34911965 PMCID: PMC8674281 DOI: 10.1038/s41467-021-27542-8] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Accepted: 11/24/2021] [Indexed: 12/17/2022] Open
Abstract
Metaproteomics has matured into a powerful tool to assess functional interactions in microbial communities. While many metaproteomic workflows are available, the impact of method choice on results remains unclear. Here, we carry out a community-driven, multi-laboratory comparison in metaproteomics: the critical assessment of metaproteome investigation study (CAMPI). Based on well-established workflows, we evaluate the effect of sample preparation, mass spectrometry, and bioinformatic analysis using two samples: a simplified, laboratory-assembled human intestinal model and a human fecal sample. We observe that variability at the peptide level is predominantly due to sample processing workflows, with a smaller contribution of bioinformatic pipelines. These peptide-level differences largely disappear at the protein group level. While differences are observed for predicted community composition, similar functional profiles are obtained across workflows. CAMPI demonstrates the robustness of present-day metaproteomics research, serves as a template for multi-laboratory studies in metaproteomics, and provides publicly available data sets for benchmarking future developments.
Collapse
Affiliation(s)
- Tim Van Den Bossche
- VIB - UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
- Department of Biomolecular Medicine, Faculty of Medicine and Health Sciences, Ghent University, Ghent, Belgium
| | - Benoit J Kunath
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Kay Schallert
- Bioprocess Engineering, Otto-von-Guericke University Magdeburg, Magdeburg, Germany
| | - Stephanie S Schäpe
- Department of Molecular Systems Biology, Helmholtz-Centre for Environmental Research - UFZ GmbH, Leipzig, Germany
| | - Paul E Abraham
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA
| | - Jean Armengaud
- Département Médicaments et Technologies pour la Santé (DMTS), Université Paris Saclay, CEA, INRAE, SPI, 30200, Bagnols-sur-Cèze, France
| | - Magnus Ø Arntzen
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences (NMBU), Ås, Norway
| | - Ariane Bassignani
- INRAE, AgroParisTech, Micalis Institute, Université Paris-Saclay, 78350, Jouy-en-Josas, France
| | - Dirk Benndorf
- Bioprocess Engineering, Otto-von-Guericke University Magdeburg, Magdeburg, Germany
- Microbiology, Department of Applied Biosciences and Process Technology, Anhalt University of Applied Sciences, Köthen, Germany
- Bioprocess Engineering, Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany
| | - Stephan Fuchs
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
| | | | - Timothy J Griffin
- Department of Biochemistry Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, USA
| | - Live H Hagen
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences (NMBU), Ås, Norway
| | - Rashi Halder
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Céline Henry
- INRAE, AgroParisTech, Micalis Institute, Université Paris-Saclay, 78350, Jouy-en-Josas, France
| | - Robert L Hettich
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA
| | - Robert Heyer
- Bioprocess Engineering, Otto-von-Guericke University Magdeburg, Magdeburg, Germany
| | - Pratik Jagtap
- Department of Biochemistry Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, USA
| | - Nico Jehmlich
- Department of Molecular Systems Biology, Helmholtz-Centre for Environmental Research - UFZ GmbH, Leipzig, Germany
| | - Marlene Jensen
- Department of Plant & Microbial Biology, North Carolina State University, Raleigh, USA
| | - Catherine Juste
- INRAE, AgroParisTech, Micalis Institute, Université Paris-Saclay, 78350, Jouy-en-Josas, France
| | - Manuel Kleiner
- Department of Plant & Microbial Biology, North Carolina State University, Raleigh, USA
| | - Olivier Langella
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE - Le Moulon, 91190, Gif-sur-Yvette, France
| | - Theresa Lehmann
- Bioprocess Engineering, Otto-von-Guericke University Magdeburg, Magdeburg, Germany
| | - Emma Leith
- Department of Biochemistry Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, USA
| | - Patrick May
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Bart Mesuere
- VIB - UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
| | - Guylaine Miotello
- Département Médicaments et Technologies pour la Santé (DMTS), Université Paris Saclay, CEA, INRAE, SPI, 30200, Bagnols-sur-Cèze, France
| | - Samantha L Peters
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA
| | - Olivier Pible
- Département Médicaments et Technologies pour la Santé (DMTS), Université Paris Saclay, CEA, INRAE, SPI, 30200, Bagnols-sur-Cèze, France
| | - Pedro T Queiros
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Udo Reichl
- Bioprocess Engineering, Otto-von-Guericke University Magdeburg, Magdeburg, Germany
- Bioprocess Engineering, Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany
| | - Bernhard Y Renard
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
- Data Analytics and Computational Statistics, Hasso-Plattner-Institute, Faculty of Digital Engineering, University of Potsdam, Potsdam, Germany
| | - Henning Schiebenhoefer
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
- Data Analytics and Computational Statistics, Hasso-Plattner-Institute, Faculty of Digital Engineering, University of Potsdam, Potsdam, Germany
| | | | - Alessandro Tanca
- Department of Biomedical Sciences, University of Sassari, Sassari, Italy
| | - Kathrin Trappe
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
| | - Jean-Pierre Trezzi
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
- Integrated Biobank of Luxembourg, Luxembourg Institute of Health, 1, rue Louis Rech, L-3555, Dudelange, Luxembourg
| | - Sergio Uzzau
- Department of Biomedical Sciences, University of Sassari, Sassari, Italy
| | - Pieter Verschaffelt
- VIB - UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
| | - Martin von Bergen
- Department of Molecular Systems Biology, Helmholtz-Centre for Environmental Research - UFZ GmbH, Leipzig, Germany
| | - Paul Wilmes
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
- Department of Life Sciences and Medicine, Faculty of Science, Technology and Medicine, University of Luxembourg, 6 avenue du Swing, L-4367, Belvaux, Luxembourg
| | - Maximilian Wolf
- Bioprocess Engineering, Otto-von-Guericke University Magdeburg, Magdeburg, Germany
| | - Lennart Martens
- VIB - UGent Center for Medical Biotechnology, VIB, Ghent, Belgium.
- Department of Biomolecular Medicine, Faculty of Medicine and Health Sciences, Ghent University, Ghent, Belgium.
| | - Thilo Muth
- Section eScience (S.3), Federal Institute for Materials Research and Testing, Berlin, Germany
| |
Collapse
|
38
|
Van Den Bossche T, Kunath BJ, Schallert K, Schäpe SS, Abraham PE, Armengaud J, Arntzen MØ, Bassignani A, Benndorf D, Fuchs S, Giannone RJ, Griffin TJ, Hagen LH, Halder R, Henry C, Hettich RL, Heyer R, Jagtap P, Jehmlich N, Jensen M, Juste C, Kleiner M, Langella O, Lehmann T, Leith E, May P, Mesuere B, Miotello G, Peters SL, Pible O, Queiros PT, Reichl U, Renard BY, Schiebenhoefer H, Sczyrba A, Tanca A, Trappe K, Trezzi JP, Uzzau S, Verschaffelt P, von Bergen M, Wilmes P, Wolf M, Martens L, Muth T. Critical Assessment of MetaProteome Investigation (CAMPI): a multi-laboratory comparison of established workflows. Nat Commun 2021; 12:7305. [PMID: 34911965 DOI: 10.1101/2021.03.05.433915] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Accepted: 11/24/2021] [Indexed: 05/21/2023] Open
Abstract
Metaproteomics has matured into a powerful tool to assess functional interactions in microbial communities. While many metaproteomic workflows are available, the impact of method choice on results remains unclear. Here, we carry out a community-driven, multi-laboratory comparison in metaproteomics: the critical assessment of metaproteome investigation study (CAMPI). Based on well-established workflows, we evaluate the effect of sample preparation, mass spectrometry, and bioinformatic analysis using two samples: a simplified, laboratory-assembled human intestinal model and a human fecal sample. We observe that variability at the peptide level is predominantly due to sample processing workflows, with a smaller contribution of bioinformatic pipelines. These peptide-level differences largely disappear at the protein group level. While differences are observed for predicted community composition, similar functional profiles are obtained across workflows. CAMPI demonstrates the robustness of present-day metaproteomics research, serves as a template for multi-laboratory studies in metaproteomics, and provides publicly available data sets for benchmarking future developments.
Collapse
Affiliation(s)
- Tim Van Den Bossche
- VIB - UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
- Department of Biomolecular Medicine, Faculty of Medicine and Health Sciences, Ghent University, Ghent, Belgium
| | - Benoit J Kunath
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Kay Schallert
- Bioprocess Engineering, Otto-von-Guericke University Magdeburg, Magdeburg, Germany
| | - Stephanie S Schäpe
- Department of Molecular Systems Biology, Helmholtz-Centre for Environmental Research - UFZ GmbH, Leipzig, Germany
| | - Paul E Abraham
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA
| | - Jean Armengaud
- Département Médicaments et Technologies pour la Santé (DMTS), Université Paris Saclay, CEA, INRAE, SPI, 30200, Bagnols-sur-Cèze, France
| | - Magnus Ø Arntzen
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences (NMBU), Ås, Norway
| | - Ariane Bassignani
- INRAE, AgroParisTech, Micalis Institute, Université Paris-Saclay, 78350, Jouy-en-Josas, France
| | - Dirk Benndorf
- Bioprocess Engineering, Otto-von-Guericke University Magdeburg, Magdeburg, Germany
- Microbiology, Department of Applied Biosciences and Process Technology, Anhalt University of Applied Sciences, Köthen, Germany
- Bioprocess Engineering, Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany
| | - Stephan Fuchs
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
| | | | - Timothy J Griffin
- Department of Biochemistry Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, USA
| | - Live H Hagen
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences (NMBU), Ås, Norway
| | - Rashi Halder
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Céline Henry
- INRAE, AgroParisTech, Micalis Institute, Université Paris-Saclay, 78350, Jouy-en-Josas, France
| | - Robert L Hettich
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA
| | - Robert Heyer
- Bioprocess Engineering, Otto-von-Guericke University Magdeburg, Magdeburg, Germany
| | - Pratik Jagtap
- Department of Biochemistry Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, USA
| | - Nico Jehmlich
- Department of Molecular Systems Biology, Helmholtz-Centre for Environmental Research - UFZ GmbH, Leipzig, Germany
| | - Marlene Jensen
- Department of Plant & Microbial Biology, North Carolina State University, Raleigh, USA
| | - Catherine Juste
- INRAE, AgroParisTech, Micalis Institute, Université Paris-Saclay, 78350, Jouy-en-Josas, France
| | - Manuel Kleiner
- Department of Plant & Microbial Biology, North Carolina State University, Raleigh, USA
| | - Olivier Langella
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE - Le Moulon, 91190, Gif-sur-Yvette, France
| | - Theresa Lehmann
- Bioprocess Engineering, Otto-von-Guericke University Magdeburg, Magdeburg, Germany
| | - Emma Leith
- Department of Biochemistry Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, USA
| | - Patrick May
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Bart Mesuere
- VIB - UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
| | - Guylaine Miotello
- Département Médicaments et Technologies pour la Santé (DMTS), Université Paris Saclay, CEA, INRAE, SPI, 30200, Bagnols-sur-Cèze, France
| | - Samantha L Peters
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA
| | - Olivier Pible
- Département Médicaments et Technologies pour la Santé (DMTS), Université Paris Saclay, CEA, INRAE, SPI, 30200, Bagnols-sur-Cèze, France
| | - Pedro T Queiros
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Udo Reichl
- Bioprocess Engineering, Otto-von-Guericke University Magdeburg, Magdeburg, Germany
- Bioprocess Engineering, Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany
| | - Bernhard Y Renard
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
- Data Analytics and Computational Statistics, Hasso-Plattner-Institute, Faculty of Digital Engineering, University of Potsdam, Potsdam, Germany
| | - Henning Schiebenhoefer
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
- Data Analytics and Computational Statistics, Hasso-Plattner-Institute, Faculty of Digital Engineering, University of Potsdam, Potsdam, Germany
| | | | - Alessandro Tanca
- Department of Biomedical Sciences, University of Sassari, Sassari, Italy
| | - Kathrin Trappe
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
| | - Jean-Pierre Trezzi
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
- Integrated Biobank of Luxembourg, Luxembourg Institute of Health, 1, rue Louis Rech, L-3555, Dudelange, Luxembourg
| | - Sergio Uzzau
- Department of Biomedical Sciences, University of Sassari, Sassari, Italy
| | - Pieter Verschaffelt
- VIB - UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
| | - Martin von Bergen
- Department of Molecular Systems Biology, Helmholtz-Centre for Environmental Research - UFZ GmbH, Leipzig, Germany
| | - Paul Wilmes
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
- Department of Life Sciences and Medicine, Faculty of Science, Technology and Medicine, University of Luxembourg, 6 avenue du Swing, L-4367, Belvaux, Luxembourg
| | - Maximilian Wolf
- Bioprocess Engineering, Otto-von-Guericke University Magdeburg, Magdeburg, Germany
| | - Lennart Martens
- VIB - UGent Center for Medical Biotechnology, VIB, Ghent, Belgium.
- Department of Biomolecular Medicine, Faculty of Medicine and Health Sciences, Ghent University, Ghent, Belgium.
| | - Thilo Muth
- Section eScience (S.3), Federal Institute for Materials Research and Testing, Berlin, Germany
| |
Collapse
|
39
|
Tyagi P, Bhide M. Development of a bioinformatics platform for analysis of quantitative transcriptomics and proteomics data: the OMnalysis. PeerJ 2021; 9:e12415. [PMID: 34820180 PMCID: PMC8588854 DOI: 10.7717/peerj.12415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Accepted: 10/10/2021] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND In the past decade, RNA sequencing and mass spectrometry based quantitative approaches are being used commonly to identify the differentially expressed biomarkers in different biological conditions. Data generated from these approaches come in different sizes (e.g., count matrix, normalized list of differentially expressed biomarkers, etc.) and shapes (e.g., sequences, spectral data, etc.). The list of differentially expressed biomarkers is used for functional interpretation and retrieve biological meaning, however, it requires moderate computational skills. Thus, researchers with no programming expertise find difficulty in data interpretation. Several bioinformatics tools are available to analyze such data; however, they are less flexible for performing the multiple steps of visualization and functional interpretation. IMPLEMENTATION We developed an easy-to-use Shiny based web application (named as OMnalysis) that provides users with a single platform to analyze and visualize the differentially expressed data. The OMnalysis accepts the data in tabular form from edgeR, DESeq2, MaxQuant Perseus, R packages, and other similar software, which typically contains the list of differentially expressed genes or proteins, log of the fold change, log of the count per million, the P value, q-value, etc. The key features of the OMnalysis are multiple image type visualization and their dimension customization options, seven multiple hypothesis testing correction methods to get more significant gene ontology, network topology-based pathway analysis, and multiple databases support (KEGG, Reactome, PANTHER, biocarta, NCI-Nature Pathway Interaction Database PharmGKB and STRINGdb) for extensive pathway enrichment analysis. OMnalysis also fetches the literature information from PubMed to provide supportive evidence to the biomarkers identified in the analysis. In a nutshell, we present the OMnalysis as a well-organized user interface, supported by peer-reviewed R packages with updated databases for quick interpretation of the differential transcriptomics and proteomics data to biological meaning. AVAILABILITY The OMnalysis codes are entirely written in R language and freely available at https://github.com/Punit201016/OMnalysis. OMnalysis can also be accessed from - http://lbmi.uvlf.sk/omnalysis.html. OMnalysis is hosted on a Shiny server at https://omnalysis.shinyapps.io/OMnalysis/. The minimum system requirements are: 4 gigabytes of RAM, i3 processor (or equivalent). It is compatible with any operating system (windows, Linux or Mac). The OMnalysis is heavily tested on Chrome web browsers; thus, Chrome is the preferred browser. OMnalysis works on Firefox and Safari.
Collapse
Affiliation(s)
- Punit Tyagi
- Laboratory of Biomedical Microbiology and Immunology, University of Veterinary Medicine and Pharmacy in Kosice, Kosice, Slovakia
- Department of Animal and Food Science, The Autonomous University of Barcelona, Barcelona, Spain
| | - Mangesh Bhide
- Laboratory of Biomedical Microbiology and Immunology, University of Veterinary Medicine and Pharmacy in Kosice, Kosice, Slovakia
- Institute of Neuroimmunology, Slovak Academy of Sciences, Bratislava, Slovakia
| |
Collapse
|
40
|
Nadel BB, Oliva M, Shou BL, Mitchell K, Ma F, Montoya DJ, Mouton A, Kim-Hellmuth S, Stranger BE, Pellegrini M, Mangul S. Systematic evaluation of transcriptomics-based deconvolution methods and references using thousands of clinical samples. Brief Bioinform 2021; 22:bbab265. [PMID: 34346485 PMCID: PMC8768458 DOI: 10.1093/bib/bbab265] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Revised: 06/07/2021] [Accepted: 06/21/2021] [Indexed: 11/13/2022] Open
Abstract
Estimating cell type composition of blood and tissue samples is a biological challenge relevant in both laboratory studies and clinical care. In recent years, a number of computational tools have been developed to estimate cell type abundance using gene expression data. Although these tools use a variety of approaches, they all leverage expression profiles from purified cell types to evaluate the cell type composition within samples. In this study, we compare 12 cell type quantification tools and evaluate their performance while using each of 10 separate reference profiles. Specifically, we have run each tool on over 4000 samples with known cell type proportions, spanning both immune and stromal cell types. A total of 12 of these represent in vitro synthetic mixtures and 300 represent in silico synthetic mixtures prepared using single-cell data. A final 3728 clinical samples have been collected from the Framingham cohort, for which cell populations have been quantified using electrical impedance cell counting. When tools are applied to the Framingham dataset, the tool Estimating the Proportions of Immune and Cancer cells (EPIC) produces the highest correlation, whereas Gene Expression Deconvolution Interactive Tool (GEDIT) produces the lowest error. The best tool for other datasets is varied, but CIBERSORT and GEDIT most consistently produce accurate results. We find that optimal reference depends on the tool used, and report suggested references to be used with each tool. Most tools return results within minutes, but on large datasets runtimes for CIBERSORT can exceed hours or even days. We conclude that deconvolution methods are capable of returning high-quality results, but that proper reference selection is critical.
Collapse
Affiliation(s)
- Brian B Nadel
- Corresponding authors: Brian B. Nadel, Tel: 310-963-7077; E-mail: ; Matteo Pellegrini, Tel: 310-825-0012, E-mail: ; Serghei Mangul, Tel: 323-442-0043, E-mail:
| | | | | | | | | | | | | | | | | | | | - Serghei Mangul
- Corresponding authors: Brian B. Nadel, Tel: 310-963-7077; E-mail: ; Matteo Pellegrini, Tel: 310-825-0012, E-mail: ; Serghei Mangul, Tel: 323-442-0043, E-mail:
| |
Collapse
|
41
|
Tan SI, Hsiang CC, Ng IS. Tailoring Genetic Elements of the Plasmid-Driven T7 System for Stable and Robust One-Step Cloning and Protein Expression in Broad Escherichia coli. ACS Synth Biol 2021; 10:2753-2762. [PMID: 34597025 DOI: 10.1021/acssynbio.1c00361] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
The plasmid-driven T7 system (PDT7) is a flexible approach to trigger protein overexpression; however, most of the reported PDT7 rely on many auxiliary elements or inducible systems to attenuate the toxicity from the orthogonality of the T7 system, which limits its application as the one-step cloning and protein expression system. In this study, we developed a stable and robust PDT7 via tailoring the genetic elements. By error-prone mutagenesis, a mutated T7RNAP with TTTT insertion conferred a trace but enough amount of T7RNAP for stable and efficient PDT7, denoted as PDT7m. The replication origin was kept at the same level, while the ribosome binding site (RBS) of the T7 promoter was the most contributing factor, thus enhancing the protein expression twofold using PDT7m. For application as a host-independent screening platform, both constitutive and IPTG-inducible PDT7m were constructed. It was found that each strain harnessed different IPTG inducibilities for tailor-made strain selection. Constitutive PDT7m was successfully used to express the homologous protein (i.e., lysine decarboxylase) or heterologous protein (i.e., carbonic anhydrase, CA) as a one-step cloning and protein expression tool to select the best strain for cadaverine (DAP) or CA production, respectively. Additionally, PDT7m is compatible with the pET system for coproduction of DAP and CA simultaneously. Finally, PDT7m was used for in vivo high-end chemical production of aminolevulinic acid (ALA), in which addition of the T7 terminator successfully enhanced 340% ALA titer, thus paving the way to rapidly and effectively screening the superior strain as a cell factory.
Collapse
Affiliation(s)
- Shih-I Tan
- Department of Chemical Engineering, National Cheng Kung University, Tainan 701, Taiwan, ROC
| | - Chuan-Chieh Hsiang
- Department of Chemical Engineering, National Cheng Kung University, Tainan 701, Taiwan, ROC
| | - I-Son Ng
- Department of Chemical Engineering, National Cheng Kung University, Tainan 701, Taiwan, ROC
| |
Collapse
|
42
|
Orchestrating and sharing large multimodal data for transparent and reproducible research. Nat Commun 2021; 12:5797. [PMID: 34608132 PMCID: PMC8490371 DOI: 10.1038/s41467-021-25974-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 09/08/2021] [Indexed: 11/08/2022] Open
Abstract
Reproducibility is essential to open science, as there is limited relevance for findings that can not be reproduced by independent research groups, regardless of its validity. It is therefore crucial for scientists to describe their experiments in sufficient detail so they can be reproduced, scrutinized, challenged, and built upon. However, the intrinsic complexity and continuous growth of biomedical data makes it increasingly difficult to process, analyze, and share with the community in a FAIR (findable, accessible, interoperable, and reusable) manner. To overcome these issues, we created a cloud-based platform called ORCESTRA ( orcestra.ca ), which provides a flexible framework for the reproducible processing of multimodal biomedical data. It enables processing of clinical, genomic and perturbation profiles of cancer samples through automated processing pipelines that are user-customizable. ORCESTRA creates integrated and fully documented data objects with persistent identifiers (DOI) and manages multiple dataset versions, which can be shared for future studies.
Collapse
|
43
|
Decamps C, Arnaud A, Petitprez F, Ayadi M, Baurès A, Armenoult L, Escalera S, Guyon I, Nicolle R, Tomasini R, de Reyniès A, Cros J, Blum Y, Richard M. DECONbench: a benchmarking platform dedicated to deconvolution methods for tumor heterogeneity quantification. BMC Bioinformatics 2021; 22:473. [PMID: 34600479 PMCID: PMC8487526 DOI: 10.1186/s12859-021-04381-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Accepted: 09/20/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Quantification of tumor heterogeneity is essential to better understand cancer progression and to adapt therapeutic treatments to patient specificities. Bioinformatic tools to assess the different cell populations from single-omic datasets as bulk transcriptome or methylome samples have been recently developed, including reference-based and reference-free methods. Improved methods using multi-omic datasets are yet to be developed in the future and the community would need systematic tools to perform a comparative evaluation of these algorithms on controlled data. RESULTS We present DECONbench, a standardized unbiased benchmarking resource, applied to the evaluation of computational methods quantifying cell-type heterogeneity in cancer. DECONbench includes gold standard simulated benchmark datasets, consisting of transcriptome and methylome profiles mimicking pancreatic adenocarcinoma molecular heterogeneity, and a set of baseline deconvolution methods (reference-free algorithms inferring cell-type proportions). DECONbench performs a systematic performance evaluation of each new methodological contribution and provides the possibility to publicly share source code and scoring. CONCLUSION DECONbench allows continuous submission of new methods in a user-friendly fashion, each novel contribution being automatically compared to the reference baseline methods, which enables crowdsourced benchmarking. DECONbench is designed to serve as a reference platform for the benchmarking of deconvolution methods in the evaluation of cancer heterogeneity. We believe it will contribute to leverage the benchmarking practices in the biomedical and life science communities. DECONbench is hosted on the open source Codalab competition platform. It is freely available at: https://competitions.codalab.org/competitions/27453 .
Collapse
Affiliation(s)
- Clémentine Decamps
- Laboratory TIMC-IMAG, UMR 5525, CNRS, Univ. Grenoble Alpes, Grenoble, France
| | - Alexis Arnaud
- Data Institute, Univ. Grenoble Alpes, Grenoble, France
| | - Florent Petitprez
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France
| | - Mira Ayadi
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France
| | - Aurélia Baurès
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France
| | - Lucile Armenoult
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France
| | | | - Sergio Escalera
- Universitat de Barcelona and Computer Vision Center, Barcelona, Spain
| | - Isabelle Guyon
- LISN (INRIA/CNRS), Université Paris-Saclay, Gif-sur-Yvette, France
| | - Rémy Nicolle
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France
| | | | - Aurélien de Reyniès
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France
| | - Jérôme Cros
- Dpt of Pathology, Beaujon Hospital, Univ. Paris-INSERM U1149, Clichy, France
| | - Yuna Blum
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France. .,IGDR UMR 6290, CNRS, Université de Rennes 1, Rennes, France.
| | - Magali Richard
- Laboratory TIMC-IMAG, UMR 5525, CNRS, Univ. Grenoble Alpes, Grenoble, France.
| |
Collapse
|
44
|
Leipzig J, Nüst D, Hoyt CT, Ram K, Greenberg J. The role of metadata in reproducible computational research. PATTERNS (NEW YORK, N.Y.) 2021; 2:100322. [PMID: 34553169 PMCID: PMC8441584 DOI: 10.1016/j.patter.2021.100322] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Reproducible computational research (RCR) is the keystone of the scientific method for in silico analyses, packaging the transformation of raw data to published results. In addition to its role in research integrity, improving the reproducibility of scientific studies can accelerate evaluation and reuse. This potential and wide support for the FAIR principles have motivated interest in metadata standards supporting reproducibility. Metadata provide context and provenance to raw data and methods and are essential to both discovery and validation. Despite this shared connection with scientific data, few studies have explicitly described how metadata enable reproducible computational research. This review employs a functional content analysis to identify metadata standards that support reproducibility across an analytic stack consisting of input data, tools, notebooks, pipelines, and publications. Our review provides background context, explores gaps, and discovers component trends of embeddedness and methodology weight from which we derive recommendations for future work.
Collapse
Affiliation(s)
- Jeremy Leipzig
- Metadata Research Center, College of Computing and Informatics, Drexel University, Philadelphia, PA, USA
| | - Daniel Nüst
- Institute for Geoinformatics, University of Münster, Münster, Germany
| | | | - Karthik Ram
- Berkeley Institute for Data Science, University of California, Berkeley, Berkeley, CA, USA
| | - Jane Greenberg
- Metadata Research Center, College of Computing and Informatics, Drexel University, Philadelphia, PA, USA
| |
Collapse
|
45
|
Stem cells characterization: OMICS reinforcing analytics. Curr Opin Biotechnol 2021; 71:175-181. [PMID: 34425321 DOI: 10.1016/j.copbio.2021.07.021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2021] [Revised: 07/15/2021] [Accepted: 07/18/2021] [Indexed: 12/20/2022]
Abstract
Stem cells hold outstanding potential to model and treat disease and are valuable tools in pharmacology and toxicology. Characterization of stem cells and derivatives still poses many challenges to ensure safe, efficacious, and reliable therapies. Regulatory agencies have defined key mandatory attributes related to identity, purity, sterility, and genomic integrity, however robust analytics to determine cell's potency are still a major challenge, in most cases assessed case-by-case. Importantly, the application of high-throughput 'omic tools is opening new perspectives on stem cell's research and development. Here, analytical methodologies currently employed to characterize stem cells' quality attributes are discussed, with special focus on 'omics as relevant tools for definition of cell's mechanism of action, and for potency assay development and assessment.
Collapse
|
46
|
Linard B, Romashchenko N, Pardi F, Rivals E. PEWO: a collection of workflows to benchmark phylogenetic placement. Bioinformatics 2021; 36:5264-5266. [PMID: 32697844 DOI: 10.1093/bioinformatics/btaa657] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2020] [Revised: 07/10/2020] [Accepted: 07/16/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Phylogenetic placement (PP) is a process of taxonomic identification for which several tools are now available. However, it remains difficult to assess which tool is more adapted to particular genomic data or a particular reference taxonomy. We developed Placement Evaluation WOrkflows (PEWO), the first benchmarking tool dedicated to PP assessment. Its automated workflows can evaluate PP at many levels, from parameter optimization for a particular tool, to the selection of the most appropriate genetic marker when PP-based species identifications are targeted. Our goal is that PEWO will become a community effort and a standard support for future developments and applications of PP. AVAILABILITY AND IMPLEMENTATION https://github.com/phylo42/PEWO. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Benjamin Linard
- Computer Science Departement, LIRMM, University of Montpellier, CNRS, Montpellier 34095, France.,SPYGEN, 73370 Le Bourget-du-Lac, France
| | - Nikolai Romashchenko
- Computer Science Departement, LIRMM, University of Montpellier, CNRS, Montpellier 34095, France
| | - Fabio Pardi
- Computer Science Departement, LIRMM, University of Montpellier, CNRS, Montpellier 34095, France
| | - Eric Rivals
- Computer Science Departement, LIRMM, University of Montpellier, CNRS, Montpellier 34095, France.,Institut Français de Bioinformatique, CNRS UMS 3601, Évry 91057, France
| |
Collapse
|
47
|
Liu X, Liu L, Wang J, Cui H, Zhao G, Wen J. FOSL2 Is Involved in the Regulation of Glycogen Content in Chicken Breast Muscle Tissue. Front Physiol 2021; 12:682441. [PMID: 34295261 PMCID: PMC8290175 DOI: 10.3389/fphys.2021.682441] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 05/03/2021] [Indexed: 01/22/2023] Open
Abstract
The glycogen content in muscle of livestock and poultry animals affects the homeostasis of their body, growth performance, and meat quality after slaughter. FOS-like 2, AP-1 transcription factor subunit (FOSL2) was identified as a candidate gene related to muscle glycogen (MG) content in chicken in our previous study, but the role of FOSL2 in the regulation of MG content remains to be elucidated. Differential gene expression analysis and weighted gene coexpression network analysis (WGCNA) were performed on differentially expressed genes (DEGs) in breast muscle tissues from the high-MG-content (HMG) group and low-MG-content (LMG) group of Jingxing yellow chickens. Analysis of the 1,171 DEGs (LMG vs. HMG) identified, besides FOSL2, some additional genes related to MG metabolism pathway, namely PRKAG3, CEBPB, FOXO1, AMPK, and PIK3CB. Additionally, WGCNA revealed that FOSL2, CEBPB, MAP3K14, SLC2A14, PPP2CA, SLC38A2, PPP2R5E, and other genes related to the classical glycogen metabolism in the same coexpressed module are associated with MG content. Also, besides finding that FOSL2 expression is negatively correlated with MG content, a possible interaction between FOSL2 and CEBPB was predicted using the STRING (Search Tool for the Retrieval of Interacting Genes) database. Furthermore, we investigated the effects of lentiviral overexpression of FOSL2 on the regulation of the glycogen content in vitro, and the result indicated that FOSL2 decreases the glycogen content in DF1 cells. Collectively, our results confirm that FOSL2 has a key role in the regulation of the MG content in chicken. This finding is helpful to understand the mechanism of MG metabolism regulation in chicken and provides a new perspective for the production of high-quality broiler and the development of a comprehensive nutritional control strategy.
Collapse
Affiliation(s)
- Xiaojing Liu
- State Key Laboratory of Animal Nutrition, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lu Liu
- College of Animal Science and Technology, College of Veterinary Medicine of Zhejiang A&F University, Hangzhou, China
| | - Jie Wang
- State Key Laboratory of Animal Nutrition, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Huanxian Cui
- State Key Laboratory of Animal Nutrition, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Guiping Zhao
- State Key Laboratory of Animal Nutrition, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Jie Wen
- State Key Laboratory of Animal Nutrition, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| |
Collapse
|
48
|
Knyazev S, Tsyvina V, Shankar A, Melnyk A, Artyomenko A, Malygina T, Porozov YB, Campbell EM, Switzer WM, Skums P, Mangul S, Zelikovsky A. Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction. Nucleic Acids Res 2021; 49:e102. [PMID: 34214168 PMCID: PMC8464054 DOI: 10.1093/nar/gkab576] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/25/2021] [Accepted: 06/18/2021] [Indexed: 12/21/2022] Open
Abstract
Rapidly evolving RNA viruses continuously produce minority haplotypes that can become dominant if they are drug-resistant or can better evade the immune system. Therefore, early detection and identification of minority viral haplotypes may help to promptly adjust the patient’s treatment plan preventing potential disease complications. Minority haplotypes can be identified using next-generation sequencing, but sequencing noise hinders accurate identification. The elimination of sequencing noise is a non-trivial task that still remains open. Here we propose CliqueSNV based on extracting pairs of statistically linked mutations from noisy reads. This effectively reduces sequencing noise and enables identifying minority haplotypes with the frequency below the sequencing error rate. We comparatively assess the performance of CliqueSNV using an in vitro mixture of nine haplotypes that were derived from the mutation profile of an existing HIV patient. We show that CliqueSNV can accurately assemble viral haplotypes with frequencies as low as 0.1% and maintains consistent performance across short and long bases sequencing platforms.
Collapse
Affiliation(s)
- Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA.,Oak Ridge Institute for Science and Education, Oak Ridge, TN 37830, USA
| | - Viachaslau Tsyvina
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Anupama Shankar
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Andrew Melnyk
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | | | - Tatiana Malygina
- International Scientific and Research Institute of Bioengineering, ITMO University, St. Petersburg 197101, Russia
| | - Yuri B Porozov
- World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia.,Department of Computational Biology, Sirius University of Science and Technology, Sochi 354340, Russia
| | - Ellsworth M Campbell
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - William M Switzer
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA 90089, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia
| |
Collapse
|
49
|
Riquier S, Bessiere C, Guibert B, Bouge AL, Boureux A, Ruffle F, Audoux J, Gilbert N, Xue H, Gautheret D, Commes T. Kmerator Suite: design of specific k-mer signatures and automatic metadata discovery in large RNA-seq datasets. NAR Genom Bioinform 2021; 3:lqab058. [PMID: 34179780 PMCID: PMC8221386 DOI: 10.1093/nargab/lqab058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Revised: 05/10/2021] [Accepted: 06/17/2021] [Indexed: 11/12/2022] Open
Abstract
The huge body of publicly available RNA-sequencing (RNA-seq) libraries is a treasure of functional information allowing to quantify the expression of known or novel transcripts in tissues. However, transcript quantification commonly relies on alignment methods requiring a lot of computational resources and processing time, which does not scale easily to large datasets. K-mer decomposition constitutes a new way to process RNA-seq data for the identification of transcriptional signatures, as k-mers can be used to quantify accurately gene expression in a less resource-consuming way. We present the Kmerator Suite, a set of three tools designed to extract specific k-mer signatures, quantify these k-mers into RNA-seq datasets and quickly visualize large dataset characteristics. The core tool, Kmerator, produces specific k-mers for 97% of human genes, enabling the measure of gene expression with high accuracy in simulated datasets. KmerExploR, a direct application of Kmerator, uses a set of predictor gene-specific k-mers to infer metadata including library protocol, sample features or contaminations from RNA-seq datasets. KmerExploR results are visualized through a user-friendly interface. Moreover, we demonstrate that the Kmerator Suite can be used for advanced queries targeting known or new biomarkers such as mutations, gene fusions or long non-coding RNAs for human health applications.
Collapse
Affiliation(s)
- Sébastien Riquier
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| | - Chloé Bessiere
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| | - Benoit Guibert
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| | | | - Anthony Boureux
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| | - Florence Ruffle
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| | | | - Nicolas Gilbert
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| | - Haoliang Xue
- Institute for Integrative Biology of the Cell, CEA, CNRS, Université Paris-Saclay, 91198, Gif sur Yvette, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell, CEA, CNRS, Université Paris-Saclay, 91198, Gif sur Yvette, France
| | - Thérèse Commes
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| |
Collapse
|
50
|
Kühl MA, Stich B, Ries DC. Mutation-Simulator: fine-grained simulation of random mutations in any genome. Bioinformatics 2021; 37:568-569. [PMID: 32780803 PMCID: PMC8088320 DOI: 10.1093/bioinformatics/btaa716] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 06/12/2020] [Accepted: 08/05/2020] [Indexed: 01/11/2023] Open
Abstract
Summary Mutation-Simulator allows the introduction of various types of sequence alterations in reference sequences, with reasonable compute-time even for large eukaryotic genomes. Its intuitive system for fine-grained control over mutation rates along the sequence enables the mimicking of natural mutation patterns. Using standard file formats for input and output data, it can easily be integrated into any development and benchmarking workflow for high-throughput sequencing applications. Availability and implementation Mutation-Simulator is written in Python 3 and the source code, documentation, help and use cases are available on the Github page at https://github.com/mkpython3/Mutation-Simulator. It is free for use under the GPL 3 license.
Collapse
Affiliation(s)
- M A Kühl
- Quantitative Genetics and Genomics of Plants, Heinrich Heine University, Düsseldorf 40225, Germany
| | - B Stich
- Quantitative Genetics and Genomics of Plants, Heinrich Heine University, Düsseldorf 40225, Germany
| | - D C Ries
- Quantitative Genetics and Genomics of Plants, Heinrich Heine University, Düsseldorf 40225, Germany
| |
Collapse
|