1
|
Nuhamunada M, Mohite OS, Phaneuf P, Palsson B, Weber T. BGCFlow: systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets. Nucleic Acids Res 2024; 52:5478-5495. [PMID: 38686794 PMCID: PMC11162802 DOI: 10.1093/nar/gkae314] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 03/22/2024] [Accepted: 04/11/2024] [Indexed: 05/02/2024] Open
Abstract
Genome mining is revolutionizing natural products discovery efforts. The rapid increase in available genomes demands comprehensive computational platforms to effectively extract biosynthetic knowledge encoded across bacterial pangenomes. Here, we present BGCFlow, a novel systematic workflow integrating analytics for large-scale genome mining of bacterial pangenomes. BGCFlow incorporates several genome analytics and mining tools grouped into five common stages of analysis such as: (i) data selection, (ii) functional annotation, (iii) phylogenetic analysis, (iv) genome mining, and (v) comparative analysis. Furthermore, BGCFlow provides easy configuration of different projects, parallel distribution, scheduled job monitoring, an interactive database to visualize tables, exploratory Jupyter Notebooks, and customized reports. Here, we demonstrate the application of BGCFlow by investigating the phylogenetic distribution of various biosynthetic gene clusters detected across 42 genomes of the Saccharopolyspora genus, known to produce industrially important secondary/specialized metabolites. The BGCFlow-guided analysis predicted more accurate dereplication of BGCs and guided the targeted comparative analysis of selected RiPPs. The scalable, interoperable, adaptable, re-entrant, and reproducible nature of the BGCFlow will provide an effective novel way to extract the biosynthetic knowledge from the ever-growing genomic datasets of biotechnologically relevant bacterial species.
Collapse
Affiliation(s)
- Matin Nuhamunada
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby 2800, Denmark
| | - Omkar S Mohite
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby 2800, Denmark
| | - Patrick V Phaneuf
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby 2800, Denmark
| | - Bernhard O Palsson
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby 2800, Denmark
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
| | - Tilmann Weber
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby 2800, Denmark
| |
Collapse
|
2
|
Sheffield NC, LeRoy NJ, Khoroshevskyi O. Challenges to sharing sample metadata in computational genomics. Front Genet 2023; 14:1154198. [PMID: 37287537 PMCID: PMC10243526 DOI: 10.3389/fgene.2023.1154198] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 05/09/2023] [Indexed: 06/09/2023] Open
Affiliation(s)
- Nathan C. Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
- School of Data Science, University of Virginia, Charlottesville, VA, United States
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Nathan J. LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Oleksandr Khoroshevskyi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| |
Collapse
|
3
|
Rodrigues DC, Mufteev M, Yuki KE, Narula A, Wei W, Piekna A, Liu J, Pasceri P, Rissland OS, Wilson MD, Ellis J. Buffering of transcription rate by mRNA half-life is a conserved feature of Rett syndrome models. Nat Commun 2023; 14:1896. [PMID: 37019888 PMCID: PMC10076348 DOI: 10.1038/s41467-023-37339-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Accepted: 03/13/2023] [Indexed: 04/07/2023] Open
Abstract
Transcriptional changes in Rett syndrome (RTT) are assumed to directly correlate with steady-state mRNA levels, but limited evidence in mice suggests that changes in transcription can be compensated by post-transcriptional regulation. We measure transcription rate and mRNA half-life changes in RTT patient neurons using RATEseq, and re-interpret nuclear and whole-cell RNAseq from Mecp2 mice. Genes are dysregulated by changing transcription rate or half-life and are buffered when both change. We utilized classifier models to predict the direction of transcription rate changes and find that combined frequencies of three dinucleotides are better predictors than CA and CG. MicroRNA and RNA-binding Protein (RBP) motifs are enriched in 3'UTRs of genes with half-life changes. Nuclear RBP motifs are enriched on buffered genes with increased transcription rate. We identify post-transcriptional mechanisms in humans and mice that alter half-life or buffer transcription rate changes when a transcriptional modulator gene is mutated in a neurodevelopmental disorder.
Collapse
Affiliation(s)
- Deivid C Rodrigues
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Marat Mufteev
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
| | - Kyoko E Yuki
- Genetics & Genome Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Ashrut Narula
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Molecular Medicine, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Wei Wei
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Alina Piekna
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Jiajie Liu
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Peter Pasceri
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Olivia S Rissland
- Molecular Medicine, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
- RNA Bioscience Initiative and Department of Biochemistry & Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Michael D Wilson
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Genetics & Genome Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - James Ellis
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada.
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada.
| |
Collapse
|
4
|
Mufteev M, Rodrigues DC, Yuki KE, Narula A, Wei W, Piekna A, Liu J, Pasceri P, Rissland OS, Wilson MD, Ellis J. Transcriptional buffering and 3'UTR lengthening are shaped during human neurodevelopment by shifts in mRNA stability and microRNA load. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.01.530249. [PMID: 36909614 PMCID: PMC10002768 DOI: 10.1101/2023.03.01.530249] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/07/2023]
Abstract
The contribution of mRNA half-life is commonly overlooked when examining changes in mRNA abundance during development. mRNA levels of some genes are regulated by transcription rate only, but others may be regulated by mRNA half-life only shifts. Furthermore, transcriptional buffering is predicted when changes in transcription rates have compensating shifts in mRNA half-life resulting in no change to steady-state levels. Likewise, transcriptional boosting should result when changes in transcription rate are accompanied by amplifying half-life shifts. During neurodevelopment there is widespread 3'UTR lengthening that could be shaped by differential shifts in the stability of existing short or long 3'UTR transcript isoforms. We measured transcription rate and mRNA half-life changes during induced human Pluripotent Stem Cell (iPSC)-derived neuronal development using RATE-seq. During transitions to progenitor and neuron stages, transcriptional buffering occurred in up to 50%, and transcriptional boosting in up to 15%, of genes with changed transcription rates. The remaining changes occurred by transcription rate only or mRNA half-life only shifts. Average mRNA half-life decreased two-fold in neurons relative to iPSCs. Short gene isoforms were more destabilized in neurons and thereby increased the average 3'UTR length. Small RNA sequencing captured an increase in microRNA copy number per cell during neurodevelopment. We propose that mRNA destabilization and 3'UTR lengthening are driven in part by an increase in microRNA load in neurons. Our findings identify mRNA stability mechanisms in human neurodevelopment that regulate gene and isoform level abundance and provide a precedent for similar post-transcriptional regulatory events as other tissues develop.
Collapse
Affiliation(s)
- Marat Mufteev
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, Ontario M5G 0A4, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario M5S 1A8, Canada
| | - Deivid C Rodrigues
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, Ontario M5G 0A4, Canada
| | - Kyoko E Yuki
- Genetics & Genome Biology, Hospital for Sick Children, Toronto, Ontario M5G 0A4, Canada
| | - Ashrut Narula
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario M5S 1A8, Canada
- Molecular Medicine, Hospital for Sick Children, Toronto, Ontario M5G 0A4, Canada
| | - Wei Wei
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, Ontario M5G 0A4, Canada
| | - Alina Piekna
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, Ontario M5G 0A4, Canada
| | - Jiajie Liu
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, Ontario M5G 0A4, Canada
| | - Peter Pasceri
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, Ontario M5G 0A4, Canada
| | - Olivia S Rissland
- Molecular Medicine, Hospital for Sick Children, Toronto, Ontario M5G 0A4, Canada
- RNA Bioscience Initiative and Department of Biochemistry & Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado 80045, USA
| | - Michael D Wilson
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario M5S 1A8, Canada
- Genetics & Genome Biology, Hospital for Sick Children, Toronto, Ontario M5G 0A4, Canada
| | - James Ellis
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, Ontario M5G 0A4, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario M5S 1A8, Canada
| |
Collapse
|
5
|
Nieminen M, Stolpe O, Kuhring M, Weiner J, Pett P, Beule D, Holtgrewe M. SODAR: managing multiomics study data and metadata. Gigascience 2022; 12:giad052. [PMID: 37498129 PMCID: PMC10373112 DOI: 10.1093/gigascience/giad052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Revised: 03/30/2023] [Accepted: 06/27/2023] [Indexed: 07/28/2023] Open
Abstract
Scientists employing omics in life science studies face challenges such as the modeling of multiassay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter. We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multiassay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file format. Data storage is handled using the iRODS data management system, which handles large quantities of files and substantial amounts of data. SODAR also offers programmable APIs and command-line access for metadata and file storage. SODAR supports complex omics integration studies and can be easily installed. The software is written in Python 3 and freely available at https://github.com/bihealth/sodar-server under the MIT license.
Collapse
Affiliation(s)
- Mikko Nieminen
- Berlin Institute of Health at Charité–Universitätsmedizin Berlin, Core Unit Bioinformatics (CUBI), Berlin 10117 , Germany
| | - Oliver Stolpe
- Berlin Institute of Health at Charité–Universitätsmedizin Berlin, Core Unit Bioinformatics (CUBI), Berlin 10117 , Germany
| | - Mathias Kuhring
- Berlin Institute of Health at Charité–Universitätsmedizin Berlin, Core Unit Bioinformatics (CUBI), Berlin 10117 , Germany
| | - January Weiner
- Berlin Institute of Health at Charité–Universitätsmedizin Berlin, Core Unit Bioinformatics (CUBI), Berlin 10117 , Germany
| | - Patrick Pett
- Berlin Institute of Health at Charité–Universitätsmedizin Berlin, Core Unit Bioinformatics (CUBI), Berlin 10117 , Germany
| | - Dieter Beule
- Berlin Institute of Health at Charité–Universitätsmedizin Berlin, Core Unit Bioinformatics (CUBI), Berlin 10117 , Germany
| | - Manuel Holtgrewe
- Berlin Institute of Health at Charité–Universitätsmedizin Berlin, Core Unit Bioinformatics (CUBI), Berlin 10117 , Germany
| |
Collapse
|
6
|
Sheffield NC, Stolarczyk M, Reuter VP, Rendeiro AF. Linking big biomedical datasets to modular analysis with Portable Encapsulated Projects. Gigascience 2021; 10:6454632. [PMID: 34890448 PMCID: PMC8673555 DOI: 10.1093/gigascience/giab077] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 04/20/2021] [Accepted: 11/04/2021] [Indexed: 12/26/2022] Open
Abstract
Background Organizing and annotating biological sample data is critical in data-intensive bioinformatics. Unfortunately, metadata formats from a data provider are often incompatible with requirements of a processing tool. There is no broadly accepted standard to organize metadata across biological projects and bioinformatics tools, restricting the portability and reusability of both annotated datasets and analysis software. Results To address this, we present the Portable Encapsulated Project (PEP) specification, a formal specification for biological sample metadata structure. The PEP specification accommodates typical features of data-intensive bioinformatics projects with many biological samples. In addition to standardization, the PEP specification provides descriptors and modifiers for project-level and sample-level metadata, which improve portability across both computing environments and data processing tools. PEPs include a schema validator framework, allowing formal definition of required metadata attributes for data analysis broadly. We have implemented packages for reading PEPs in both Python and R to provide a language-agnostic interface for organizing project metadata. Conclusions The PEP specification is an important step toward unifying data annotation and processing tools in data-intensive biological research projects. Links to tools and documentation are available at http://pep.databio.org/.
Collapse
Affiliation(s)
- Nathan C Sheffield
- Center for Public Health Genomics, University of Virginia, VA 22908, USA.,Department of Public Health Sciences, University of Virginia, VA 22908, USA.,Department of Biomedical Engineering, University of Virginia, VA 22908, USA.,Department of Biochemistry and Molecular Genetics, University of Virginia, VA 22908, USA
| | - Michał Stolarczyk
- Center for Public Health Genomics, University of Virginia, VA 22908, USA
| | - Vincent P Reuter
- Center for Public Health Genomics, University of Virginia, VA 22908, USA.,Genomics and Computational Biology Graduate Group, University of Pennsylvania, PA 19087, USA
| | - André F Rendeiro
- Institute for Computational Biomedicine, Weill Cornell Medical College, NY 10021, USA.,Caryl and Israel Englander Institute for Precision Medicine, Weill Cornell Medical College, NY 10021, USA
| |
Collapse
|