1
|
Thomson AJ, Rehn JA, Heatley SL, Eadie LN, Page EC, Schutz C, McClure BJ, Sutton R, Dalla-Pozza L, Moore AS, Greenwood M, Kotecha RS, Fong CY, Yong ASM, Yeung DT, Breen J, White DL. Reproducible Bioinformatics Analysis Workflows for Detecting IGH Gene Fusions in B-Cell Acute Lymphoblastic Leukaemia Patients. Cancers (Basel) 2023; 15:4731. [PMID: 37835427 PMCID: PMC10571859 DOI: 10.3390/cancers15194731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 09/22/2023] [Indexed: 10/15/2023] Open
Abstract
B-cell acute lymphoblastic leukaemia (B-ALL) is characterised by diverse genomic alterations, the most frequent being gene fusions detected via transcriptomic analysis (mRNA-seq). Due to its hypervariable nature, gene fusions involving the Immunoglobulin Heavy Chain (IGH) locus can be difficult to detect with standard gene fusion calling algorithms and significant computational resources and analysis times are required. We aimed to optimize a gene fusion calling workflow to achieve best-case sensitivity for IGH gene fusion detection. Using Nextflow, we developed a simplified workflow containing the algorithms FusionCatcher, Arriba, and STAR-Fusion. We analysed samples from 35 patients harbouring IGH fusions (IGH::CRLF2 n = 17, IGH::DUX4 n = 15, IGH::EPOR n = 3) and assessed the detection rates for each caller, before optimizing the parameters to enhance sensitivity for IGH fusions. Initial results showed that FusionCatcher and Arriba outperformed STAR-Fusion (85-89% vs. 29% of IGH fusions reported). We found that extensive filtering in STAR-Fusion hindered IGH reporting. By adjusting specific filtering steps (e.g., read support, fusion fragments per million total reads), we achieved a 94% reporting rate for IGH fusions with STAR-Fusion. This analysis highlights the importance of filtering optimization for IGH gene fusion events, offering alternative workflows for difficult-to-detect high-risk B-ALL subtypes.
Collapse
Affiliation(s)
- Ashlee J. Thomson
- Faculty of Health and Medical Sciences, University of Adelaide, Adelaide, SA 5005, Australia; (J.A.R.); (S.L.H.); (L.N.E.); (E.C.P.); (B.J.M.); (A.S.M.Y.); (D.T.Y.); (D.L.W.)
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, SA 5000, Australia;
| | - Jacqueline A. Rehn
- Faculty of Health and Medical Sciences, University of Adelaide, Adelaide, SA 5005, Australia; (J.A.R.); (S.L.H.); (L.N.E.); (E.C.P.); (B.J.M.); (A.S.M.Y.); (D.T.Y.); (D.L.W.)
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, SA 5000, Australia;
| | - Susan L. Heatley
- Faculty of Health and Medical Sciences, University of Adelaide, Adelaide, SA 5005, Australia; (J.A.R.); (S.L.H.); (L.N.E.); (E.C.P.); (B.J.M.); (A.S.M.Y.); (D.T.Y.); (D.L.W.)
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, SA 5000, Australia;
- Australian and New Zealand Children’s Oncology Group (ANZCHOG), Clayton, VIC 3168, Australia
| | - Laura N. Eadie
- Faculty of Health and Medical Sciences, University of Adelaide, Adelaide, SA 5005, Australia; (J.A.R.); (S.L.H.); (L.N.E.); (E.C.P.); (B.J.M.); (A.S.M.Y.); (D.T.Y.); (D.L.W.)
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, SA 5000, Australia;
| | - Elyse C. Page
- Faculty of Health and Medical Sciences, University of Adelaide, Adelaide, SA 5005, Australia; (J.A.R.); (S.L.H.); (L.N.E.); (E.C.P.); (B.J.M.); (A.S.M.Y.); (D.T.Y.); (D.L.W.)
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, SA 5000, Australia;
| | - Caitlin Schutz
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, SA 5000, Australia;
| | - Barbara J. McClure
- Faculty of Health and Medical Sciences, University of Adelaide, Adelaide, SA 5005, Australia; (J.A.R.); (S.L.H.); (L.N.E.); (E.C.P.); (B.J.M.); (A.S.M.Y.); (D.T.Y.); (D.L.W.)
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, SA 5000, Australia;
| | - Rosemary Sutton
- Molecular Diagnostics, Children’s Cancer Institute, Kensington, NSW 2750, Australia;
| | - Luciano Dalla-Pozza
- The Cancer Centre for Children, The Children’s Hospital at Westmead, Westmead, NSW 2145, Australia;
| | - Andrew S. Moore
- Oncology Service, Children’s Health Queensland Hospital and Health Service, Brisbane, QLD 4101, Australia;
- Child Health Research Centre, The University of Queensland, Brisbane, QLD 4000, Australia
| | - Matthew Greenwood
- Department of Haematology and Transfusion Services, Royal North Shore Hospital, Sydney, NSW 2065, Australia;
- Faculty of Health and Medicine, University of Sydney, Sydney, NSW 2006, Australia
| | - Rishi S. Kotecha
- Department of Clinical Haematology, Oncology, Blood and Marrow Transplantation, Perth Children’s Hospital, Perth, WA 6009, Australia;
- Leukaemia Translational Research Laboratory, Telethon Kids Cancer Centre, Telethon Kids Institute, University of Western Australia, Perth, WA 6009, Australia
- Curtin Medical School, Curtin University, Perth, WA 6845, Australia
| | - Chun Y. Fong
- Department of Clinical Haematology, Austin Health, Heidelberg, VIC 3083, Australia;
| | - Agnes S. M. Yong
- Faculty of Health and Medical Sciences, University of Adelaide, Adelaide, SA 5005, Australia; (J.A.R.); (S.L.H.); (L.N.E.); (E.C.P.); (B.J.M.); (A.S.M.Y.); (D.T.Y.); (D.L.W.)
- South Australian Health & Medical Research Institute (SAHMRI), Adelaide, SA 5000, Australia
- Division of Pathology & Laboratory, University of Western Australia Medical School, Perth, WA 6009, Australia
- Department of Haematology, Royal Perth Hospital, Perth, WA 6000, Australia
| | - David T. Yeung
- Faculty of Health and Medical Sciences, University of Adelaide, Adelaide, SA 5005, Australia; (J.A.R.); (S.L.H.); (L.N.E.); (E.C.P.); (B.J.M.); (A.S.M.Y.); (D.T.Y.); (D.L.W.)
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, SA 5000, Australia;
- Haematology Department, Royal Adelaide Hospital and SA Pathology, Adelaide, SA 5000, Australia
| | - James Breen
- Black Ochre Data Labs, Indigenous Genomics, Telethon Kids Institute, Adelaide, SA 5000, Australia
- James Curtin School of Medical Research, Australian National University, Canberra, ACT 2601, Australia
| | - Deborah L. White
- Faculty of Health and Medical Sciences, University of Adelaide, Adelaide, SA 5005, Australia; (J.A.R.); (S.L.H.); (L.N.E.); (E.C.P.); (B.J.M.); (A.S.M.Y.); (D.T.Y.); (D.L.W.)
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, SA 5000, Australia;
- Australian and New Zealand Children’s Oncology Group (ANZCHOG), Clayton, VIC 3168, Australia
- Australian Genomics Health Alliance (AGHA), The Murdoch Children’s Research Institute, Parkville, VIC 3052, Australia
| |
Collapse
|
2
|
Chang J, Stahlke AR, Chudalayandi S, Rosen BD, Childers AK, Severin AJ. polishCLR: A Nextflow Workflow for Polishing PacBio CLR Genome Assemblies. Genome Biol Evol 2023; 15:7040681. [PMID: 36792366 PMCID: PMC9985148 DOI: 10.1093/gbe/evad020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 02/02/2023] [Accepted: 02/08/2023] [Indexed: 02/17/2023] Open
Abstract
Long-read sequencing has revolutionized genome assembly, yielding highly contiguous, chromosome-level contigs. However, assemblies from some third generation long read technologies, such as Pacific Biosciences (PacBio) continuous long reads (CLR), have a high error rate. Such errors can be corrected with short reads through a process called polishing. Although best practices for polishing non-model de novo genome assemblies were recently described by the Vertebrate Genome Project (VGP) Assembly community, there is a need for a publicly available, reproducible workflow that can be easily implemented and run on a conventional high performance computing environment. Here, we describe polishCLR (https://github.com/isugifNF/polishCLR), a reproducible Nextflow workflow that implements best practices for polishing assemblies made from CLR data. PolishCLR can be initiated from several input options that extend best practices to suboptimal cases. It also provides re-entry points throughout several key processes, including identifying duplicate haplotypes in purge_dups, allowing a break for scaffolding if data are available, and throughout multiple rounds of polishing and evaluation with Arrow and FreeBayes. PolishCLR is containerized and publicly available for the greater assembly community as a tool to complete assemblies from existing, error-prone long-read data.
Collapse
Affiliation(s)
- Jennifer Chang
- USDA, Agricultural Research Service, Jamie Whitten Delta States Research Center, Genomics and Bioinformatics Research Unit, Stoneville, Mississippi.,Oak Ridge Institute for Science and Education, Oak Ridge, Tennessee.,Genome Informatics Facility, Office of Biotechnology, Iowa State University, Ames
| | - Amanda R Stahlke
- USDA, Agricultural Research Service, Beltsville Agricultural Research Center, Bee Research Laboratory, Beltsville Maryland
| | | | - Benjamin D Rosen
- USDA, Agricultural Research Service, Beltsville Agricultural Research Center, Animal Genomics and Improvement Laboratory, Beltsville, Maryland
| | - Anna K Childers
- USDA, Agricultural Research Service, Beltsville Agricultural Research Center, Bee Research Laboratory, Beltsville Maryland
| | - Andrew J Severin
- Genome Informatics Facility, Office of Biotechnology, Iowa State University, Ames
| |
Collapse
|
3
|
Big Data in Gastroenterology Research. Int J Mol Sci 2023; 24:ijms24032458. [PMID: 36768780 PMCID: PMC9916510 DOI: 10.3390/ijms24032458] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 01/18/2023] [Accepted: 01/20/2023] [Indexed: 01/28/2023] Open
Abstract
Studying individual data types in isolation provides only limited and incomplete answers to complex biological questions and particularly falls short in revealing sufficient mechanistic and kinetic details. In contrast, multi-omics approaches to studying health and disease permit the generation and integration of multiple data types on a much larger scale, offering a comprehensive picture of biological and disease processes. Gastroenterology and hepatobiliary research are particularly well-suited to such analyses, given the unique position of the luminal gastrointestinal (GI) tract at the nexus between the gut (mucosa and luminal contents), brain, immune and endocrine systems, and GI microbiome. The generation of 'big data' from multi-omic, multi-site studies can enhance investigations into the connections between these organ systems and organisms and more broadly and accurately appraise the effects of dietary, pharmacological, and other therapeutic interventions. In this review, we describe a variety of useful omics approaches and how they can be integrated to provide a holistic depiction of the human and microbial genetic and proteomic changes underlying physiological and pathophysiological phenomena. We highlight the potential pitfalls and alternatives to help avoid the common errors in study design, execution, and analysis. We focus on the application, integration, and analysis of big data in gastroenterology and hepatobiliary research.
Collapse
|
4
|
Mokou M, Narayanasamy S, Stroggilos R, Balaur IA, Vlahou A, Mischak H, Frantzi M. A Drug Repurposing Pipeline Based on Bladder Cancer Integrated Proteotranscriptomics Signatures. Methods Mol Biol 2023; 2684:59-99. [PMID: 37410228 DOI: 10.1007/978-1-0716-3291-8_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/07/2023]
Abstract
Delivering better care for patients with bladder cancer (BC) necessitates the development of novel therapeutic strategies that address both the high disease heterogeneity and the limitations of the current therapeutic modalities, such as drug low efficacy and patient resistance acquisition. Drug repurposing is a cost-effective strategy that targets the reuse of existing drugs for new therapeutic purposes. Such a strategy could open new avenues toward more effective BC treatment. BC patients' multi-omics signatures can be used to guide the investigation of existing drugs that show an effective therapeutic potential through drug repurposing. In this book chapter, we present an integrated multilayer approach that includes cross-omics analyses from publicly available transcriptomics and proteomics data derived from BC tissues and cell lines that were investigated for the development of disease-specific signatures. These signatures are subsequently used as input for a signature-based repurposing approach using the Connectivity Map (CMap) tool. We further explain the steps that may be followed to identify and select existing drugs of increased potential for repurposing in BC patients.
Collapse
Affiliation(s)
- Marika Mokou
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany.
| | - Shaman Narayanasamy
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Rafael Stroggilos
- Systems Biology Center, Biomedical Research Foundation, Academy of Athens, Athens, Greece
| | - Irina-Afrodita Balaur
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Antonia Vlahou
- Systems Biology Center, Biomedical Research Foundation, Academy of Athens, Athens, Greece
| | - Harald Mischak
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany
- Institute of Cardiovascular and Medical Sciences, University of Glasgow, Glasgow, UK
| | - Maria Frantzi
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany
| |
Collapse
|
5
|
Salazar VW, Shaban B, Quiroga MDM, Turnbull R, Tescari E, Rossetto Marcelino V, Verbruggen H, Lê Cao KA. Metaphor-A workflow for streamlined assembly and binning of metagenomes. Gigascience 2022; 12:giad055. [PMID: 37522759 PMCID: PMC10388702 DOI: 10.1093/gigascience/giad055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Revised: 06/05/2023] [Accepted: 07/04/2023] [Indexed: 08/01/2023] Open
Abstract
Recent advances in bioinformatics and high-throughput sequencing have enabled the large-scale recovery of genomes from metagenomes. This has the potential to bring important insights as researchers can bypass cultivation and analyze genomes sourced directly from environmental samples. There are, however, technical challenges associated with this process, most notably the complexity of computational workflows required to process metagenomic data, which include dozens of bioinformatics software tools, each with their own set of customizable parameters that affect the final output of the workflow. At the core of these workflows are the processes of assembly-combining the short-input reads into longer, contiguous fragments (contigs)-and binning, clustering these contigs into individual genome bins. The limitations of assembly and binning algorithms also pose different challenges depending on the selected strategy to execute them. Both of these processes can be done for each sample separately or by pooling together multiple samples to leverage information from a combination of samples. Here we present Metaphor, a fully automated workflow for genome-resolved metagenomics (GRM). Metaphor differs from existing GRM workflows by offering flexible approaches for the assembly and binning of the input data and by combining multiple binning algorithms with a bin refinement step to achieve high-quality genome bins. Moreover, Metaphor generates reports to evaluate the performance of the workflow. We showcase the functionality of Metaphor on different synthetic datasets and the impact of available assembly and binning strategies on the final results.
Collapse
Affiliation(s)
- Vinícius W Salazar
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Parkville, VIC 3052, Victoria, Australia
| | - Babak Shaban
- Melbourne Data Analytics Platform (MDAP), University of Melbourne, Carlton, VIC 3053, Victoria, Australia
| | - Maria del Mar Quiroga
- Melbourne Data Analytics Platform (MDAP), University of Melbourne, Carlton, VIC 3053, Victoria, Australia
| | - Robert Turnbull
- Melbourne Data Analytics Platform (MDAP), University of Melbourne, Carlton, VIC 3053, Victoria, Australia
| | - Edoardo Tescari
- Melbourne Data Analytics Platform (MDAP), University of Melbourne, Carlton, VIC 3053, Victoria, Australia
| | - Vanessa Rossetto Marcelino
- Department of Molecular and Translational Sciences, Monash University, Clayton, VIC 3168, Victoria, Australia
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, VIC 3168, Victoria, Australia
- School of BioSciences, University of Melbourne, Parkville, VIC 3052, Victoria, Australia
- Department of Microbiology and Immunology, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Parkville, VIC 3052, Victoria, Australia
| | - Heroen Verbruggen
- School of BioSciences, University of Melbourne, Parkville, VIC 3052, Victoria, Australia
| | - Kim-Anh Lê Cao
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Parkville, VIC 3052, Victoria, Australia
| |
Collapse
|
6
|
Cope AL, Anderson F, Favate J, Jackson M, Mok A, Kurowska A, Liu J, MacKenzie E, Shivakumar V, Tilton P, Winterbourne SM, Xue S, Kavoussanakis K, Lareau LF, Shah P, Wallace EWJ. riboviz 2: a flexible and robust ribosome profiling data analysis and visualization workflow. Bioinformatics 2022; 38:2358-2360. [PMID: 35157051 PMCID: PMC9004635 DOI: 10.1093/bioinformatics/btac093] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Revised: 09/28/2021] [Accepted: 02/09/2022] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION Ribosome profiling, or Ribo-seq, is the state-of-the-art method for quantifying protein synthesis in living cells. Computational analysis of Ribo-seq data remains challenging due to the complexity of the procedure, as well as variations introduced for specific organisms or specialized analyses. RESULTS We present riboviz 2, an updated riboviz package, for the comprehensive transcript-centric analysis and visualization of Ribo-seq data. riboviz 2 includes an analysis workflow built on the Nextflow workflow management system for end-to-end processing of Ribo-seq data. riboviz 2 has been extensively tested on diverse species and library preparation strategies, including multiplexed samples. riboviz 2 is flexible and uses open, documented file formats, allowing users to integrate new analyses with the pipeline. AVAILABILITY AND IMPLEMENTATION riboviz 2 is freely available at github.com/riboviz/riboviz.
Collapse
Affiliation(s)
- Alexander L Cope
- Department of Genetics, Rutgers University, Piscataway, NJ 08854-8082, USA
| | - Felicity Anderson
- Institute for Cell Biology and SynthSys, School of Biological Sciences, The University of Edinburgh, Edinburgh EH9 3BF, UK
| | - John Favate
- Department of Genetics, Rutgers University, Piscataway, NJ 08854-8082, USA
| | | | - Amanda Mok
- Center for Computational Biology, University of California, Berkeley, CA 94720, USA
| | - Anna Kurowska
- Institute for Cell Biology and SynthSys, School of Biological Sciences, The University of Edinburgh, Edinburgh EH9 3BF, UK
| | - Junchen Liu
- EPCC, The University of Edinburgh, Edinburgh EH8 9BT, UK
| | - Emma MacKenzie
- Institute for Cell Biology and SynthSys, School of Biological Sciences, The University of Edinburgh, Edinburgh EH9 3BF, UK
| | - Vikram Shivakumar
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA
| | - Peter Tilton
- Department of Genetics, Rutgers University, Piscataway, NJ 08854-8082, USA
| | - Sophie M Winterbourne
- Institute for Cell Biology and SynthSys, School of Biological Sciences, The University of Edinburgh, Edinburgh EH9 3BF, UK
| | - Siyin Xue
- Institute for Cell Biology and SynthSys, School of Biological Sciences, The University of Edinburgh, Edinburgh EH9 3BF, UK
| | | | - Liana F Lareau
- Center for Computational Biology, University of California, Berkeley, CA 94720, USA
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA
| | - Premal Shah
- Department of Genetics, Rutgers University, Piscataway, NJ 08854-8082, USA
| | - Edward W J Wallace
- Institute for Cell Biology and SynthSys, School of Biological Sciences, The University of Edinburgh, Edinburgh EH9 3BF, UK
| |
Collapse
|
7
|
Allain F, Roméjon J, La Rosa P, Jarlier F, Servant N, Hupé P. Geniac: Automatic Configuration GENerator and Installer for nextflow pipelines. OPEN RESEARCH EUROPE 2022; 1:76. [PMID: 37645091 PMCID: PMC10445886 DOI: 10.12688/openreseurope.13861.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 02/11/2022] [Indexed: 08/31/2023]
Abstract
With the advent of high-throughput biotechnological platforms and their ever-growing capacity, life science has turned into a digitized, computational and data-intensive discipline. As a consequence, standard analysis with a bioinformatics pipeline in the context of routine production has become a challenge such that the data can be processed in real-time and delivered to the end-users as fast as possible. The usage of workflow management systems along with packaging systems and containerization technologies offer an opportunity to tackle this challenge. While very powerful, they can be used and combined in many multiple ways which may differ from one developer to another. Therefore, promoting the homogeneity of the workflow implementation requires guidelines and protocols which detail how the source code of the bioinformatics pipeline should be written and organized to ensure its usability, maintainability, interoperability, sustainability, portability, reproducibility, scalability and efficiency. Capitalizing on Nextflow, Conda, Docker, Singularity and the nf-core initiative, we propose a set of best practices along the development life cycle of the bioinformatics pipeline and deployment for production operations which target different expert communities including i) the bioinformaticians and statisticians ii) the software engineers and iii) the data managers and core facility engineers. We implemented Geniac (Automatic Configuration GENerator and Installer for nextflow pipelines) which consists of a toolbox with three components: i) a technical documentation available at https://geniac.readthedocs.io to detail coding guidelines for the bioinformatics pipeline with Nextflow, ii) a command line interface with a linter to check that the code respects the guidelines, and iii) an add-on to generate configuration files, build the containers and deploy the pipeline. The Geniac toolbox aims at the harmonization of development practices across developers and automation of the generation of configuration files and containers by parsing the source code of the Nextflow pipeline.
Collapse
Affiliation(s)
- Fabrice Allain
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Julien Roméjon
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Philippe La Rosa
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Frédéric Jarlier
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Nicolas Servant
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
| | - Philippe Hupé
- Mines Paris Tech, Fontainebleau, F-77305, France
- Institut Curie, Paris, F-75005, France
- U900, Inserm, Paris, F-75005, France
- PSL Research University, Paris, F-75005, France
- UMR144, CNRS, Paris, F-75005, France
| |
Collapse
|
8
|
Raghavan V, Kraft L, Mesny F, Rigerte L. A simple guide to de novo transcriptome assembly and annotation. Brief Bioinform 2022; 23:6514404. [PMID: 35076693 PMCID: PMC8921630 DOI: 10.1093/bib/bbab563] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 12/03/2021] [Accepted: 12/09/2021] [Indexed: 12/13/2022] Open
Abstract
A transcriptome constructed from short-read RNA sequencing (RNA-seq) is an easily attainable proxy catalog of protein-coding genes when genome assembly is unnecessary, expensive or difficult. In the absence of a sequenced genome to guide the reconstruction process, the transcriptome must be assembled de novo using only the information available in the RNA-seq reads. Subsequently, the sequences must be annotated in order to identify sequence-intrinsic and evolutionary features in them (for example, protein-coding regions). Although straightforward at first glance, de novo transcriptome assembly and annotation can quickly prove to be challenging undertakings. In addition to familiarizing themselves with the conceptual and technical intricacies of the tasks at hand and the numerous pre- and post-processing steps involved, those interested must also grapple with an overwhelmingly large choice of tools. The lack of standardized workflows, fast pace of development of new tools and techniques and paucity of authoritative literature have served to exacerbate the difficulty of the task even further. Here, we present a comprehensive overview of de novo transcriptome assembly and annotation. We discuss the procedures involved, including pre- and post-processing steps, and present a compendium of corresponding tools.
Collapse
Affiliation(s)
- Venket Raghavan
- Corresponding authors: Venket Raghavan, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail: ; Louis Kraft, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail:
| | - Louis Kraft
- Corresponding authors: Venket Raghavan, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail: ; Louis Kraft, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail:
| | | | | |
Collapse
|
9
|
Morandin C, Brendel VP. Tools and applications for integrative analysis of DNA methylation in social insects. Mol Ecol Resour 2021; 22:1656-1674. [PMID: 34861105 DOI: 10.1111/1755-0998.13566] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 11/18/2021] [Accepted: 11/23/2021] [Indexed: 12/15/2022]
Abstract
DNA methylation is a common epigenetic signalling tool and an important biological process which is widely studied in a large array of species. The presence, level and function of DNA methylation vary greatly across species. In some insects, DNA methylation systems are minimal, and overall methylation rates tend to be low in all studied insect species. Low methylation levels probed by whole-genome bisulphite sequencing require great care with respect to data quality control and interpretation. Here, we introduce BWASP/R, a complete workflow that allows efficient, scalable and entirely reproducible analyses of raw DNA methylation sequencing data. Consistent application of quality control filters and analysis parameters provides fair comparisons among different studies and an integrated view of all experiments on one species. We describe the capabilities of the BWASP/R workflow by re-analysing several publicly available social insect WGBS data sets, comprising 70 samples and cumulatively 147 replicates from four different species. We show that the CpG methylome comprises only about 1.5% of CpG sites in the honeybee genome and that the cumulative data are consistent with genetic signatures of site accessibility and physiological control of methylation levels.
Collapse
Affiliation(s)
- Claire Morandin
- Department of Ecology and Evolution, Biophore, University of Lausanne, Lausanne, Switzerland
| | - Volker P Brendel
- Departments of Biology and Computer Science, Indiana University, Bloomingto, Indiana, USA
| |
Collapse
|
10
|
Design considerations for workflow management systems use in production genomics research and the clinic. Sci Rep 2021; 11:21680. [PMID: 34737383 PMCID: PMC8569008 DOI: 10.1038/s41598-021-99288-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Accepted: 09/15/2021] [Indexed: 01/22/2023] Open
Abstract
The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance.
Collapse
|
11
|
Wratten L, Wilm A, Göke J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 2021; 18:1161-1168. [PMID: 34556866 DOI: 10.1038/s41592-021-01254-9] [Citation(s) in RCA: 39] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 07/29/2021] [Indexed: 02/08/2023]
Abstract
The rapid growth of high-throughput technologies has transformed biomedical research. With the increasing amount and complexity of data, scalability and reproducibility have become essential not just for experiments, but also for computational analysis. However, transforming data into information involves running a large number of tools, optimizing parameters, and integrating dynamically changing reference data. Workflow managers were developed in response to such challenges. They simplify pipeline development, optimize resource usage, handle software installation and versions, and run on different compute platforms, enabling workflow portability and sharing. In this Perspective, we highlight key features of workflow managers, compare commonly used approaches for bioinformatics workflows, and provide a guide for computational and noncomputational users. We outline community-curated pipeline initiatives that enable novice and experienced users to perform complex, best-practice analyses without having to manually assemble workflows. In sum, we illustrate how workflow managers contribute to making computational analysis in biomedical research shareable, scalable, and reproducible.
Collapse
Affiliation(s)
| | | | - Jonathan Göke
- Genome Institute of Singapore, Singapore, Singapore.
| |
Collapse
|
12
|
Singh U, Li J, Seetharam A, Wurtele ES. pyrpipe: a Python package for RNA-Seq workflows. NAR Genom Bioinform 2021; 3:lqab049. [PMID: 34085037 PMCID: PMC8168212 DOI: 10.1093/nargab/lqab049] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Revised: 05/06/2021] [Accepted: 05/18/2021] [Indexed: 02/06/2023] Open
Abstract
The availability of terabytes of RNA-Seq data and continuous emergence of new analysis tools, enable unprecedented biological insight. There is a pressing requirement for a framework that allows for fast, efficient, manageable, and reproducible RNA-Seq analysis. We have developed a Python package, (pyrpipe), that enables straightforward development of flexible, reproducible and easy-to-debug computational pipelines purely in Python, in an object-oriented manner. pyrpipe provides access to popular RNA-Seq tools, within Python, via high-level APIs. Pipelines can be customized by integrating new Python code, third-party programs, or Python libraries. Users can create checkpoints in the pipeline or integrate pyrpipe into a workflow management system, thus allowing execution on multiple computing environments, and enabling efficient resource management. pyrpipe produces detailed analysis, and benchmark reports which can be shared or included in publications. pyrpipe is implemented in Python and is compatible with Python versions 3.6 and higher. To illustrate the rich functionality of pyrpipe, we provide case studies using RNA-Seq data from GTEx, SARS-CoV-2-infected human cells, and Zea mays. All source code is freely available at https://github.com/urmi-21/pyrpipe; the package can be installed from the source, from PyPI (https://pypi.org/project/pyrpipe), or from bioconda (https://anaconda.org/bioconda/pyrpipe). Documentation is available at (http://pyrpipe.rtfd.io).
Collapse
Affiliation(s)
- Urminder Singh
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50014, USA
- Center for Metabolic Biology, Iowa State University, Ames, IA 50014, USA
- Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50014, USA
| | - Jing Li
- Center for Metabolic Biology, Iowa State University, Ames, IA 50014, USA
- Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50014, USA
| | - Arun Seetharam
- Genome Informatics Facility, Iowa State University, Ames, IA 50014, USA
| | - Eve Syrkin Wurtele
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50014, USA
- Center for Metabolic Biology, Iowa State University, Ames, IA 50014, USA
- Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50014, USA
| |
Collapse
|