1
|
Zhu K, Schäffer AA, Robinson W, Xu J, Ruppin E, Ergun AF, Ye Y, Sahinalp SC. Strain level microbial detection and quantification with applications to single cell metagenomics. Nat Commun 2022; 13:6430. [PMID: 36307411 PMCID: PMC9616933 DOI: 10.1038/s41467-022-33869-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Accepted: 10/04/2022] [Indexed: 12/25/2022] Open
Abstract
Computational identification and quantification of distinct microbes from high throughput sequencing data is crucial for our understanding of human health. Existing methods either use accurate but computationally expensive alignment-based approaches or less accurate but computationally fast alignment-free approaches, which often fail to correctly assign reads to genomes. Here we introduce CAMMiQ, a combinatorial optimization framework to identify and quantify distinct genomes (specified by a database) in a metagenomic dataset. As a key methodological innovation, CAMMiQ uses substrings of variable length and those that appear in two genomes in the database, as opposed to the commonly used fixed-length, unique substrings. These substrings allow to accurately decouple mixtures of highly similar genomes resulting in higher accuracy than the leading alternatives, without requiring additional computational resources, as demonstrated on commonly used benchmarking datasets. Importantly, we show that CAMMiQ can distinguish closely related bacterial strains in simulated metagenomic and real single-cell metatranscriptomic data.
Collapse
Affiliation(s)
- Kaiyuan Zhu
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
- Department of Computer Science & Engineering, UC San Diego, La Jolla, CA, USA
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - Alejandro A Schäffer
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Welles Robinson
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
- Surgery Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Junyan Xu
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Eytan Ruppin
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - A Funda Ergun
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - Yuzhen Ye
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - S Cenk Sahinalp
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
- Department of Computer Science, Indiana University, Bloomington, IN, USA.
| |
Collapse
|
2
|
Sarkar H, Srivastava A, Bravo HC, Love MI, Patro R. Terminus enables the discovery of data-driven, robust transcript groups from RNA-seq data. Bioinformatics 2021; 36:i102-i110. [PMID: 32657377 PMCID: PMC7355257 DOI: 10.1093/bioinformatics/btaa448] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Advances in sequencing technology, inference algorithms and differential testing methodology have enabled transcript-level analysis of RNA-seq data. Yet, the inherent inferential uncertainty in transcript-level abundance estimation, even among the most accurate approaches, means that robust transcript-level analysis often remains a challenge. Conversely, gene-level analysis remains a common and robust approach for understanding RNA-seq data, but it coarsens the resulting analysis to the level of genes, even if the data strongly support specific transcript-level effects. RESULTS We introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. Our approach, implemented in the tool terminus, groups together transcripts in a data-driven manner allowing transcript-level analysis where it can be confidently supported, and deriving transcriptional groups where the inferential uncertainty is too high to support a transcript-level result. AVAILABILITY AND IMPLEMENTATION Terminus is implemented in Rust, and is freely available and open source. It can be obtained from https://github.com/COMBINE-lab/Terminus. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hirak Sarkar
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA.,Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Avi Srivastava
- Department of Computer Science, Stony Brook University, Stony Brook, NY 11794, USA
| | - Héctor Corrada Bravo
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA.,Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Michael I Love
- Department of Biostatistics, University of North Carolina-Chapel Hill, Chapel Hill, NC 27516, USA.,Department of Genetics, University of North Carolina-Chapel Hill, Chapel Hill, NC 27514, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA.,Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| |
Collapse
|
3
|
Berger E, Yorukoglu D, Zhang L, Nyquist SK, Shalek AK, Kellis M, Numanagić I, Berger B. Improved haplotype inference by exploiting long-range linking and allelic imbalance in RNA-seq datasets. Nat Commun 2020; 11:4662. [PMID: 32938926 PMCID: PMC7494856 DOI: 10.1038/s41467-020-18320-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Accepted: 08/07/2020] [Indexed: 01/04/2023] Open
Abstract
Haplotype reconstruction of distant genetic variants remains an unsolved problem due to the short-read length of common sequencing data. Here, we introduce HapTree-X, a probabilistic framework that utilizes latent long-range information to reconstruct unspecified haplotypes in diploid and polyploid organisms. It introduces the observation that differential allele-specific expression can link genetic variants from the same physical chromosome, thus even enabling using reads that cover only individual variants. We demonstrate HapTree-X's feasibility on in-house sequenced Genome in a Bottle RNA-seq and various whole exome, genome, and 10X Genomics datasets. HapTree-X produces more complete phases (up to 25%), even in clinically important genes, and phases more variants than other methods while maintaining similar or higher accuracy and being up to 10× faster than other tools. The advantage of HapTree-X's ability to use multiple lines of evidence, as well as to phase polyploid genomes in a single integrative framework, substantially grows as the amount of diverse data increases.
Collapse
Affiliation(s)
- Emily Berger
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
- Department of Mathematics, UC Berkeley, Berkeley, CA, 94720, USA
| | - Deniz Yorukoglu
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Lillian Zhang
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Sarah K Nyquist
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Alex K Shalek
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Manolis Kellis
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Ibrahim Numanagić
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
- Department of Computer Science, University of Victoria, Victoria, BC, V8P 5C2, Canada.
| | - Bonnie Berger
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
| |
Collapse
|
4
|
Chowdhury HA, Bhattacharyya DK, Kalita JK. Differential Expression Analysis of RNA-seq Reads: Overview, Taxonomy, and Tools. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:566-586. [PMID: 30281477 DOI: 10.1109/tcbb.2018.2873010] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Analysis of RNA-sequence (RNA-seq) data is widely used in transcriptomic studies and it has many applications. We review RNA-seq data analysis from RNA-seq reads to the results of differential expression analysis. In addition, we perform a descriptive comparison of tools used in each step of RNA-seq data analysis along with a discussion of important characteristics of these tools. A taxonomy of tools is also provided. A discussion of issues in quality control and visualization of RNA-seq data is also included along with useful tools. Finally, we provide some guidelines for the RNA-seq data analyst, along with research issues and challenges which should be addressed.
Collapse
|
5
|
Labadorf A, Choi SH, Myers RH. Evidence for a Pan-Neurodegenerative Disease Response in Huntington's and Parkinson's Disease Expression Profiles. Front Mol Neurosci 2018; 10:430. [PMID: 29375298 PMCID: PMC5768647 DOI: 10.3389/fnmol.2017.00430] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 12/12/2017] [Indexed: 12/17/2022] Open
Abstract
Huntington's and Parkinson's Diseases (HD and PD) are neurodegenerative disorders that share some pathological features but are disparate in others. For example, while both diseases are marked by aberrant protein aggregation in the brain, the specific proteins that aggregate and types of neurons affected differ. A better understanding of the molecular similarities and differences between these two diseases may lead to a more complete mechanistic picture of both the individual diseases and the neurodegenerative process in general. We sought to characterize the common transcriptional signature of HD and PD as well as genes uniquely implicated in each of these diseases using mRNA-Seq data from post mortem human brains in comparison to neuropathologically normal controls. The enriched biological pathways implicated by HD differentially expressed genes show remarkable consistency with those for PD differentially expressed genes and implicate the common biological processes of neuroinflammation, apoptosis, transcriptional dysregulation, and neuron-associated functions. Comparison of the differentially expressed (DE) genes highlights a set of consistently altered genes that span both diseases. In particular, processes involving nuclear factor kappa-light-chain-enhancer of activated B cells (NFkB) and transcription factor cAMP response element-binding protein (CREB) are the most prominent among the genes common to HD and PD. When the combined HD and PD data are compared to controls, relatively few additional biological processes emerge as significantly enriched, suggesting that most pathways are independently seen within each disorder. Despite showing comparable numbers of DE genes, DE genes unique to HD are enriched in far more coherent biological processes than the DE genes unique to PD, suggesting that PD may represent a more heterogeneous disorder. The complexity of the biological processes implicated by this analysis provides impetus for the development of better experimental models to validate the results.
Collapse
Affiliation(s)
- Adam Labadorf
- Bioinformatics Program, Boston University, Boston, MA, United States.,Department of Neurology, Boston University, Boston, MA, United States
| | - Seung H Choi
- Biostatistics, Boston University School of Public Health, Boston, MA, United States
| | - Richard H Myers
- Bioinformatics Program, Boston University, Boston, MA, United States.,Department of Neurology, Boston University, Boston, MA, United States.,Biostatistics, Boston University School of Public Health, Boston, MA, United States
| |
Collapse
|
6
|
Yorukoglu D, Yu YW, Peng J, Berger B. Compressive mapping for next-generation sequencing. Nat Biotechnol 2016; 34:374-6. [PMID: 27054987 PMCID: PMC5080835 DOI: 10.1038/nbt.3511] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Affiliation(s)
- Deniz Yorukoglu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Yun William Yu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Jian Peng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, 61801, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| |
Collapse
|
7
|
Evidence of Extensive Alternative Splicing in Post Mortem Human Brain HTT Transcription by mRNA Sequencing. PLoS One 2015; 10:e0141298. [PMID: 26496077 PMCID: PMC4619731 DOI: 10.1371/journal.pone.0141298] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2015] [Accepted: 10/07/2015] [Indexed: 11/30/2022] Open
Abstract
Despite 20 years since its discovery, the gene responsible for Huntington’s Disease, HTT, has still not had its function or transcriptional profile completely characterized. In response to a recent report by Ruzo et al. of several novel splice forms of HTT in human embryonic stem cell lines, we have analyzed a set of mRNA sequencing datasets from post mortem human brain from Huntington’s disease, Parkinson’s disease, and neurologically normal control subjects to evaluate support for previously observed and to identify novel splice patterns. A custom analysis pipeline produced supporting evidence for some of the results reported by two previous studies of alternative isoforms as well as identifying previously unreported splice patterns. All of the alternative splice patterns were of relatively low abundance compared to the canonical splice form.
Collapse
|
8
|
Spies D, Ciaudo C. Dynamics in Transcriptomics: Advancements in RNA-seq Time Course and Downstream Analysis. Comput Struct Biotechnol J 2015; 13:469-77. [PMID: 26430493 PMCID: PMC4564389 DOI: 10.1016/j.csbj.2015.08.004] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Revised: 08/05/2015] [Accepted: 08/07/2015] [Indexed: 12/17/2022] Open
Abstract
Analysis of gene expression has contributed to a plethora of biological and medical research studies. Microarrays have been intensively used for the profiling of gene expression during diverse developmental processes, treatments and diseases. New massively parallel sequencing methods, often named as RNA-sequencing (RNA-seq) are extensively improving our understanding of gene regulation and signaling networks. Computational methods developed originally for microarrays analysis can now be optimized and applied to genome-wide studies in order to have access to a better comprehension of the whole transcriptome. This review addresses current challenges on RNA-seq analysis and specifically focuses on new bioinformatics tools developed for time series experiments. Furthermore, possible improvements in analysis, data integration as well as future applications of differential expression analysis are discussed.
Collapse
Affiliation(s)
- Daniel Spies
- Swiss Federal Institute of Technology Zurich, Department of Biology, Institute of Molecular Health Sciences, Zurich, Otto-Stern Weg 7, 8093 Zurich, Switzerland
- Life Science Zurich Graduate School, Molecular Life Science Program, University of Zurich, Institute of Molecular Life Sciences, Winterthurerstrasse 190, 8057 Zurich, Switzerland
| | - Constance Ciaudo
- Swiss Federal Institute of Technology Zurich, Department of Biology, Institute of Molecular Health Sciences, Zurich, Otto-Stern Weg 7, 8093 Zurich, Switzerland
| |
Collapse
|
9
|
Li W, Freudenberg J. Characterizing regions in the human genome unmappable by next-generation-sequencing at the read length of 1000 bases. Comput Biol Chem 2014; 53 Pt A:108-17. [PMID: 25241312 DOI: 10.1016/j.compbiolchem.2014.08.015] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 12/31/2022]
Abstract
Repetitive and redundant regions of a genome are particularly problematic for mapping sequencing reads. In the present paper, we compile a list of the unmappable regions in the human genome based on the following definition: hypothetical reads with length 1 kb which cannot be uniquely mapped with zero-mismatch alignment for the described regions, considering both the forward and reverse strand. The respective collection of unmappable regions covers 0.77% of the sequence of human autosomes and 8.25% of the sex chromosomes in the reference genome GRCh37/hg19 (overall 1.23%). Not surprisingly, our unmappable regions overlap greatly with segmental duplication, transposable elements, and structural variants. About 99.8% of bases in our unmappable regions are part of either segmental duplication or transposable elements and 98.3% overlap structural variant annotations. Notably, some of these regions overlap units with important biological functions, including 4% of protein-coding genes. In contrast, these regions have zero intersection with the ultraconserved elements, very low overlap with microRNAs, tRNAs, pseudogenes, CpG islands, tandem repeats, microsatellites, sensitive non-coding regions, and the mapping blacklist regions from the ENCODE project.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, NY 11030, USA.
| | - Jan Freudenberg
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, NY 11030, USA
| |
Collapse
|