1
|
Morrissey A, Shi J, James DQ, Mahony S. Accurate allocation of multimapped reads enables regulatory element analysis at repeats. Genome Res 2024; 34:937-951. [PMID: 38986578 PMCID: PMC11293539 DOI: 10.1101/gr.278638.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 06/14/2024] [Indexed: 07/12/2024]
Abstract
Transposable elements (TEs) and other repetitive regions have been shown to contain gene regulatory elements, including transcription factor binding sites. However, regulatory elements harbored by repeats have proven difficult to characterize using short-read sequencing assays such as ChIP-seq or ATAC-seq. Most regulatory genomics analysis pipelines discard "multimapped" reads that align equally well to multiple genomic locations. Because multimapped reads arise predominantly from repeats, current analysis pipelines fail to detect a substantial portion of regulatory events that occur in repetitive regions. To address this shortcoming, we developed Allo, a new approach to allocate multimapped reads in an efficient, accurate, and user-friendly manner. Allo combines probabilistic mapping of multimapped reads with a convolutional neural network that recognizes the read distribution features of potential peaks, offering enhanced accuracy in multimapping read assignment. Allo also provides read-level output in the form of a corrected alignment file, making it compatible with existing regulatory genomics analysis pipelines and downstream peak-finders. In a demonstration application on CTCF ChIP-seq data, we show that Allo results in the discovery of thousands of new CTCF peaks. Many of these peaks contain the expected cognate motif and/or serve as TAD boundaries. We additionally apply Allo to a diverse collection of ENCODE ChIP-seq data sets, resulting in multiple previously unidentified interactions between transcription factors and repetitive element families. Finally, we show that Allo may be particularly beneficial in identifying ChIP-seq peaks at centromeres, near segmentally duplicated genes, and in younger TEs, enabling new regulatory analyses in these regions.
Collapse
Affiliation(s)
- Alexis Morrissey
- Center for Eukaryotic Gene Regulation, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Jeffrey Shi
- Center for Eukaryotic Gene Regulation, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Daniela Q James
- Center for Eukaryotic Gene Regulation, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Shaun Mahony
- Center for Eukaryotic Gene Regulation, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| |
Collapse
|
2
|
Zhao P, Peng C, Fang L, Wang Z, Liu GE. Taming transposable elements in livestock and poultry: a review of their roles and applications. Genet Sel Evol 2023; 55:50. [PMID: 37479995 PMCID: PMC10362595 DOI: 10.1186/s12711-023-00821-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 06/30/2023] [Indexed: 07/23/2023] Open
Abstract
Livestock and poultry play a significant role in human nutrition by converting agricultural by-products into high-quality proteins. To meet the growing demand for safe animal protein, genetic improvement of livestock must be done sustainably while minimizing negative environmental impacts. Transposable elements (TE) are important components of livestock and poultry genomes, contributing to their genetic diversity, chromatin states, gene regulatory networks, and complex traits of economic value. However, compared to other species, research on TE in livestock and poultry is still in its early stages. In this review, we analyze 72 studies published in the past 20 years, summarize the TE composition in livestock and poultry genomes, and focus on their potential roles in functional genomics. We also discuss bioinformatic tools and strategies for integrating multi-omics data with TE, and explore future directions, feasibility, and challenges of TE research in livestock and poultry. In addition, we suggest strategies to apply TE in basic biological research and animal breeding. Our goal is to provide a new perspective on the importance of TE in livestock and poultry genomes.
Collapse
Affiliation(s)
- Pengju Zhao
- Hainan Institute of Zhejiang University, Hainan Sanya, 572000, China
- College of Animal Sciences, Zhejiang University, Zhejiang, Hangzhou, People's Republic of China
| | - Chen Peng
- Hainan Institute of Zhejiang University, Hainan Sanya, 572000, China
- College of Animal Sciences, Zhejiang University, Zhejiang, Hangzhou, People's Republic of China
| | - Lingzhao Fang
- Center for Quantitative Genetics and Genomics, Aarhus University, 8000, Aarhus, Denmark.
| | - Zhengguang Wang
- Hainan Institute of Zhejiang University, Hainan Sanya, 572000, China.
- College of Animal Sciences, Zhejiang University, Zhejiang, Hangzhou, People's Republic of China.
| | - George E Liu
- Animal Genomics and Improvement Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, USDA, Beltsville, MD, 20705, USA.
| |
Collapse
|
3
|
Libbrecht MW, Chan RCW, Hoffman MM. Segmentation and genome annotation algorithms for identifying chromatin state and other genomic patterns. PLoS Comput Biol 2021; 17:e1009423. [PMID: 34648491 PMCID: PMC8516206 DOI: 10.1371/journal.pcbi.1009423] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Segmentation and genome annotation (SAGA) algorithms are widely used to understand genome activity and gene regulation. These algorithms take as input epigenomic datasets, such as chromatin immunoprecipitation-sequencing (ChIP-seq) measurements of histone modifications or transcription factor binding. They partition the genome and assign a label to each segment such that positions with the same label exhibit similar patterns of input data. SAGA algorithms discover categories of activity such as promoters, enhancers, or parts of genes without prior knowledge of known genomic elements. In this sense, they generally act in an unsupervised fashion like clustering algorithms, but with the additional simultaneous function of segmenting the genome. Here, we review the common methodological framework that underlies these methods, review variants of and improvements upon this basic framework, and discuss the outlook for future work. This review is intended for those interested in applying SAGA methods and for computational researchers interested in improving upon them.
Collapse
Affiliation(s)
| | - Rachel C. W. Chan
- Department of Computer Science, University of Toronto, Toronto, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
| | - Michael M. Hoffman
- Department of Computer Science, University of Toronto, Toronto, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Canada
- Vector Institute for Artificial Intelligence, Toronto, Canada
| |
Collapse
|
4
|
Shah RN, Ruthenburg AJ. Sequence deeper without sequencing more: Bayesian resolution of ambiguously mapped reads. PLoS Comput Biol 2021; 17:e1008926. [PMID: 33872311 PMCID: PMC8084338 DOI: 10.1371/journal.pcbi.1008926] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Revised: 04/29/2021] [Accepted: 03/30/2021] [Indexed: 11/18/2022] Open
Abstract
Next-generation sequencing (NGS) has transformed molecular biology and contributed to many seminal insights into genomic regulation and function. Apart from whole-genome sequencing, an NGS workflow involves alignment of the sequencing reads to the genome of study, after which the resulting alignments can be used for downstream analyses. However, alignment is complicated by the repetitive sequences; many reads align to more than one genomic locus, with 15-30% of the genome not being uniquely mappable by short-read NGS. This problem is typically addressed by discarding reads that do not uniquely map to the genome, but this practice can lead to systematic distortion of the data. Previous studies that developed methods for handling ambiguously mapped reads were often of limited applicability or were computationally intensive, hindering their broader usage. In this work, we present SmartMap: an algorithm that augments industry-standard aligners to enable usage of ambiguously mapped reads by assigning weights to each alignment with Bayesian analysis of the read distribution and alignment quality. SmartMap is computationally efficient, utilizing far fewer weighting iterations than previously thought necessary to process alignments and, as such, analyzing more than a billion alignments of NGS reads in approximately one hour on a desktop PC. By applying SmartMap to peak-type NGS data, including MNase-seq, ChIP-seq, and ATAC-seq in three organisms, we can increase read depth by up to 53% and increase the mapped proportion of the genome by up to 18% compared to analyses utilizing only uniquely mapped reads. We further show that SmartMap enables the analysis of more than 140,000 repetitive elements that could not be analyzed by traditional ChIP-seq workflows, and we utilize this method to gain insight into the epigenetic regulation of different classes of repetitive elements. These data emphasize both the dangers of discarding ambiguously mapped reads and their power for driving biological discovery.
Collapse
Affiliation(s)
- Rohan N. Shah
- Pritzker School of Medicine, Division of the Biological Sciences, The University of Chicago, Chicago, Illinois, United States of America
- Department of Molecular Biology and Cell Genetics, Division of the Biological Sciences, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail: (RNS); (AJR)
| | - Alexander J. Ruthenburg
- Department of Molecular Biology and Cell Genetics, Division of the Biological Sciences, The University of Chicago, Chicago, Illinois, United States of America
- Department of Biochemistry and Molecular Biology, Division of the Biological Sciences, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail: (RNS); (AJR)
| |
Collapse
|
5
|
O'Neill K, Brocks D, Hammell MG. Mobile genomics: tools and techniques for tackling transposons. Philos Trans R Soc Lond B Biol Sci 2020; 375:20190345. [PMID: 32075565 PMCID: PMC7061981 DOI: 10.1098/rstb.2019.0345] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/20/2019] [Indexed: 12/22/2022] Open
Abstract
Next-generation sequencing approaches have fundamentally changed the types of questions that can be asked about gene function and regulation. With the goal of approaching truly genome-wide quantifications of all the interaction partners and downstream effects of particular genes, these quantitative assays have allowed for an unprecedented level of detail in exploring biological interactions. However, many challenges remain in our ability to accurately describe and quantify the interactions that take place in those hard to reach and extremely repetitive regions of our genome comprised mostly of transposable elements (TEs). Tools dedicated to TE-derived sequences have lagged behind, making the inclusion of these sequences in genome-wide analyses difficult. Recent improvements, both computational and experimental, allow for the better inclusion of TE sequences in genomic assays and a renewed appreciation for the importance of TE biology. This review will discuss the recent improvements that have been made in the computational analysis of TE-derived sequences as well as the areas where such analysis still proves difficult. This article is part of a discussion meeting issue 'Crossroads between transposons and gene regulation'.
Collapse
Affiliation(s)
- Kathryn O'Neill
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - David Brocks
- Department of Computer Science and Applied Mathematics, The Weizmann Institute of Science, Rehovot, Israel
| | - Molly Gale Hammell
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| |
Collapse
|
6
|
Shen F, Kidd JM. Rapid, Paralog-Sensitive CNV Analysis of 2457 Human Genomes Using QuicK-mer2. Genes (Basel) 2020; 11:genes11020141. [PMID: 32013076 PMCID: PMC7073954 DOI: 10.3390/genes11020141] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Revised: 01/21/2020] [Accepted: 01/24/2020] [Indexed: 12/22/2022] Open
Abstract
Gene duplication is a major mechanism for the evolution of gene novelty, and copy-number variation makes a major contribution to inter-individual genetic diversity. However, most approaches for studying copy-number variation rely upon uniquely mapping reads to a genome reference and are unable to distinguish among duplicated sequences. Specialized approaches to interrogate specific paralogs are comparatively slow and have a high degree of computational complexity, limiting their effective application to emerging population-scale data sets. We present QuicK-mer2, a self-contained, mapping-free approach that enables the rapid construction of paralog-specific copy-number maps from short-read sequence data. This approach is based on the tabulation of unique k-mer sequences from short-read data sets, and is able to analyze a 20X coverage human genome in approximately 20 min. We applied our approach to newly released sequence data from the 1000 Genomes Project, constructed paralog-specific copy-number maps from 2457 unrelated individuals, and uncovered copy-number variation of paralogous genes. We identify nine genes where none of the analyzed samples have a copy number of two, 92 genes where the majority of samples have a copy number other than two, and describe rare copy number variation effecting multiple genes at the APOBEC3 locus.
Collapse
Affiliation(s)
- Feichen Shen
- Department of Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA;
| | - Jeffrey M. Kidd
- Department of Human Genetics and Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
- Correspondence:
| |
Collapse
|
7
|
Zheng Y, Ay F, Keles S. Generative modeling of multi-mapping reads with mHi-C advances analysis of Hi-C studies. eLife 2019; 8:e38070. [PMID: 30702424 PMCID: PMC6450682 DOI: 10.7554/elife.38070] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Accepted: 01/30/2019] [Indexed: 12/18/2022] Open
Abstract
Current Hi-C analysis approaches are unable to account for reads that align to multiple locations, and hence underestimate biological signal from repetitive regions of genomes. We developed and validated mHi-C, a multi-read mapping strategy to probabilistically allocate Hi-C multi-reads. mHi-C exhibited superior performance over utilizing only uni-reads and heuristic approaches aimed at rescuing multi-reads on benchmarks. Specifically, mHi-C increased the sequencing depth by an average of 20% resulting in higher reproducibility of contact matrices and detected interactions across biological replicates. The impact of the multi-reads on the detection of significant interactions is influenced marginally by the relative contribution of multi-reads to the sequencing depth compared to uni-reads, cis-to-trans ratio of contacts, and the broad data quality as reflected by the proportion of mappable reads of datasets. Computational experiments highlighted that in Hi-C studies with short read lengths, mHi-C rescued multi-reads can emulate the effect of longer reads. mHi-C also revealed biologically supported bona fide promoter-enhancer interactions and topologically associating domains involving repetitive genomic regions, thereby unlocking a previously masked portion of the genome for conformation capture studies.
Collapse
Affiliation(s)
- Ye Zheng
- Department of StatisticsUniversity of Wisconsin-MadisonMadisonUnited States
| | - Ferhat Ay
- La Jolla Institute for Allergy and ImmunologyLa JollaUnited States
- School of MedicineUniversity of California, San DiegoLa JollaUnited States
| | - Sunduz Keles
- Department of StatisticsUniversity of Wisconsin-MadisonMadisonUnited States
- Department of Biostatistics and Medical InformaticsUniversity of Wisconsin-MadisonMadisonUnited States
| |
Collapse
|
8
|
Liu P, Sanalkumar R, Bresnick EH, Keleş S, Dewey CN. Integrative analysis with ChIP-seq advances the limits of transcript quantification from RNA-seq. Genome Res 2016; 26:1124-33. [PMID: 27405803 PMCID: PMC4971760 DOI: 10.1101/gr.199174.115] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2015] [Accepted: 06/13/2016] [Indexed: 11/24/2022]
Abstract
RNA-seq is currently the technology of choice for global measurement of transcript abundances in cells. Despite its successes, isoform-level quantification remains difficult because short RNA-seq reads are often compatible with multiple alternatively spliced isoforms. Existing methods rely heavily on uniquely mapping reads, which are not available for numerous isoforms that lack regions of unique sequence. To improve quantification accuracy in such difficult cases, we developed a novel computational method, prior-enhanced RSEM (pRSEM), which uses a complementary data type in addition to RNA-seq data. We found that ChIP-seq data of RNA polymerase II and histone modifications were particularly informative in this approach. In qRT-PCR validations, pRSEM was shown to be superior than competing methods in estimating relative isoform abundances within or across conditions. Data-driven simulations suggested that pRSEM has a greatly decreased false-positive rate at the expense of a small increase in false-negative rate. In aggregate, our study demonstrates that pRSEM transforms existing capacity to precisely estimate transcript abundances, especially at the isoform level.
Collapse
Affiliation(s)
- Peng Liu
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin 53706, USA
| | - Rajendran Sanalkumar
- Department of Cell and Regenerative Biology, UW-Madison Blood Research Program, Carbone Cancer Center, University of Wisconsin School of Medicine and Public Health, University of Wisconsin, Madison, Wisconsin 53705, USA
| | - Emery H Bresnick
- Department of Cell and Regenerative Biology, UW-Madison Blood Research Program, Carbone Cancer Center, University of Wisconsin School of Medicine and Public Health, University of Wisconsin, Madison, Wisconsin 53705, USA
| | - Sündüz Keleş
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin 53706, USA; Department of Statistics, University of Wisconsin, Madison, Wisconsin 53706, USA
| | - Colin N Dewey
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin 53706, USA; Department of Computer Sciences, University of Wisconsin, Madison, Wisconsin 53706, USA
| |
Collapse
|