1
|
Klepikova AV, Kasianov AS, Chesnokov MS, Lazarevich NL, Penin AA, Logacheva M. Effect of method of deduplication on estimation of differential gene expression using RNA-seq. PeerJ 2017; 5:e3091. [PMID: 28321364 PMCID: PMC5357343 DOI: 10.7717/peerj.3091] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2016] [Accepted: 02/14/2017] [Indexed: 12/11/2022] Open
Abstract
Background RNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads. Results To infer the influence of different methods of removal of duplicated reads on estimation of gene expression in cancer genomics, we analyzed paired samples of hepatocellular carcinoma (HCC) and non-tumor liver tissue. Four protocols of data analysis were applied to each sample: processing without deduplication, deduplication using a method implemented in SAMtools, and deduplication based on one or two molecular indices (MI). We also analyzed the influence of sequencing layout (single read or paired end) and read length. We found that deduplication without MI greatly affects estimated expression values; this effect is the most pronounced for highly expressed genes. Conclusion The use of unique molecular identifiers greatly improves accuracy of RNA-seq analysis, especially for highly expressed genes. We developed a set of scripts that enable handling of MI and their incorporation into RNA-seq analysis pipelines. Deduplication without MI affects results of differential gene expression analysis, producing a high proportion of false negative results. The absence of duplicate read removal is biased towards false positives. In those cases where using MI is not possible, we recommend using paired-end sequencing layout.
Collapse
Affiliation(s)
- Anna V Klepikova
- Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, Russia.,A. N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, Russia
| | - Artem S Kasianov
- A. N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, Russia.,N. I. Vavilov Institute for General Genetics, Moscow, Russia
| | - Mikhail S Chesnokov
- N.N. Blokhin Russian Cancer Research Center of the Ministry of Health of the Russian Federation, Moscow, Russia
| | - Natalia L Lazarevich
- N.N. Blokhin Russian Cancer Research Center of the Ministry of Health of the Russian Federation, Moscow, Russia.,Department of Biology, Lomonosov Moscow State University, Moscow, Russia
| | - Aleksey A Penin
- Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, Russia.,A. N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, Russia.,Department of Biology, Lomonosov Moscow State University, Moscow, Russia
| | - Maria Logacheva
- Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, Russia.,A. N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, Russia.,Extreme Biology Laboratory, Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan
| |
Collapse
|
2
|
Mendizabal I, Shi L, Keller TE, Konopka G, Preuss TM, Hsieh TF, Hu E, Zhang Z, Su B, Yi SV. Comparative Methylome Analyses Identify Epigenetic Regulatory Loci of Human Brain Evolution. Mol Biol Evol 2016; 33:2947-2959. [PMID: 27563052 PMCID: PMC5062329 DOI: 10.1093/molbev/msw176] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
How do epigenetic modifications change across species and how do these modifications affect evolution? These are fundamental questions at the forefront of our evolutionary epigenomic understanding. Our previous work investigated human and chimpanzee brain methylomes, but it was limited by the lack of outgroup data which is critical for comparative (epi)genomic studies. Here, we compared whole genome DNA methylation maps from brains of humans, chimpanzees and also rhesus macaques (outgroup) to elucidate DNA methylation changes during human brain evolution. Moreover, we validated that our approach is highly robust by further examining 38 human-specific DMRs using targeted deep genomic and bisulfite sequencing in an independent panel of 37 individuals from five primate species. Our unbiased genome-scan identified human brain differentially methylated regions (DMRs), irrespective of their associations with annotated genes. Remarkably, over half of the newly identified DMRs locate in intergenic regions or gene bodies. Nevertheless, their regulatory potential is on par with those of promoter DMRs. An intriguing observation is that DMRs are enriched in active chromatin loops, suggesting human-specific evolutionary remodeling at a higher-order chromatin structure. These findings indicate that there is substantial reprogramming of epigenomic landscapes during human brain evolution involving noncoding regions.
Collapse
Affiliation(s)
- Isabel Mendizabal
- School of Biological Sciences, Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, GA Department of Genetics, Physical Anthropology and Animal Physiology, University of the Basque Country, Leioa, Spain
| | - Lei Shi
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China The Molecular & Behavioral Neuroscience Institute, University of Michigan, Ann Arbor, MI
| | - Thomas E Keller
- School of Biological Sciences, Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, GA
| | - Genevieve Konopka
- Department of Neuroscience, University of Texas Southwestern Medical Center, Dallas, TX
| | - Todd M Preuss
- Division of Neuropharmacology and Neurologic Diseases & Center for Translational Social Neuroscience, Department of Pathology and Laboratory Medicine, Yerkes National Primate Research Center, Emory University School of Medicine, Emory University, Atlanta, GA
| | - Tzung-Fu Hsieh
- Department of Plant and Microbial Biology and Plants for Human Health Institute, North Carolina State University, Raleigh, NC
| | - Enzhi Hu
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China Kunming College of Life Science, University of Chinese Academy of Sciences, Beijing, China
| | - Zhe Zhang
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| | - Bing Su
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| | - Soojin V Yi
- School of Biological Sciences, Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, GA
| |
Collapse
|
3
|
Babayan A, Alawi M, Gormley M, Müller V, Wikman H, McMullin RP, Smirnov DA, Li W, Geffken M, Pantel K, Joosse SA. Comparative study of whole genome amplification and next generation sequencing performance of single cancer cells. Oncotarget 2017; 8:56066-80. [PMID: 28915574 DOI: 10.18632/oncotarget.10701] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2016] [Accepted: 06/09/2016] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Whole genome amplification (WGA) is required for single cell genotyping. Effectiveness of currently available WGA technologies in combination with next generation sequencing (NGS) and material preservation is still elusive. RESULTS In respect to the accuracy of SNP/mutation, indel, and copy number aberrations (CNA) calling, the HiSeq2000 platform outperformed IonProton in all aspects. Furthermore, more accurate SNP/mutation and indel calling was demonstrated using single tumor cells obtained from EDTA-collected blood in respect to CellSave-preserved blood, whereas CNA analysis in our study was not detectably affected by fixation. Although MDA-based WGA yielded the highest DNA amount, DNA quality was not adequate for downstream analysis. PCR-based WGA demonstrates superiority over MDA-PCR combining technique for SNP and indel analysis in single cells. However, SNP calling performance of MDA-PCR WGA improves with increasing amount of input DNA, whereas CNA analysis does not. The performance of PCR-based WGA did not significantly improve with increase of input material. CNA profiles of single cells, amplified with MDA-PCR technique and sequenced on both HiSeq2000 and IonProton platforms, resembled unamplified DNA the most. MATERIALS AND METHODS We analyzed the performance of PCR-based, multiple-displacement amplification (MDA)-based, and MDA-PCR combining WGA techniques (WGA kits Ampli1, REPLI-g, and PicoPlex, respectively) on single and pooled tumor cells obtained from EDTA- and CellSave-preserved blood and archival material. Amplified DNA underwent exome-Seq with the Illumina HiSeq2000 and ThermoFisher IonProton platforms. CONCLUSION We demonstrate the feasibility of single cell genotyping of differently preserved material, nevertheless, WGA and NGS approaches have to be chosen carefully depending on the study aims.
Collapse
|
4
|
|
5
|
Budak H, Kantar M. Harnessing NGS and Big Data Optimally: Comparison of miRNA Prediction from Assembled versus Non-assembled Sequencing Data--The Case of the Grass Aegilops tauschii Complex Genome. OMICS 2015; 19:407-15. [PMID: 26061358 DOI: 10.1089/omi.2015.0038] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
MicroRNAs (miRNAs) are small, endogenous, non-coding RNA molecules that regulate gene expression at the post-transcriptional level. As high-throughput next generation sequencing (NGS) and Big Data rapidly accumulate for various species, efforts for in silico identification of miRNAs intensify. Surprisingly, the effect of the input genomics sequence on the robustness of miRNA prediction was not evaluated in detail to date. In the present study, we performed a homology-based miRNA and isomiRNA prediction of the 5D chromosome of bread wheat progenitor, Aegilops tauschii, using two distinct sequence data sets as input: (1) raw sequence reads obtained from 454-GS FLX Titanium sequencing platform and (2) an assembly constructed from these reads. We also compared this method with a number of available plant sequence datasets. We report here the identification of 62 and 22 miRNAs from raw reads and the assembly, respectively, of which 16 were predicted with high confidence from both datasets. While raw reads promoted sensitivity with the high number of miRNAs predicted, 55% (12 out of 22) of the assembly-based predictions were supported by previous observations, bringing specificity forward compared to the read-based predictions, of which only 37% were supported. Importantly, raw reads could identify several repeat-related miRNAs that could not be detected with the assembly. However, raw reads could not capture 6 miRNAs, for which the stem-loops could only be covered by the relatively longer sequences from the assembly. In summary, the comparison of miRNA datasets obtained by these two strategies revealed that utilization of raw reads, as well as assemblies for in silico prediction, have distinct advantages and disadvantages. Consideration of these important nuances can benefit future miRNA identification efforts in the current age of NGS and Big Data driven life sciences innovation.
Collapse
Affiliation(s)
- Hikmet Budak
- Molecular Biology, Genetics and Bioengineering Program, Faculty of Engineering and Natural Sciences, Sabanci University , Istanbul, Turkey
| | - Melda Kantar
- Molecular Biology, Genetics and Bioengineering Program, Faculty of Engineering and Natural Sciences, Sabanci University , Istanbul, Turkey
| |
Collapse
|
6
|
Oulas A, Pavloudi C, Polymenakou P, Pavlopoulos GA, Papanikolaou N, Kotoulas G, Arvanitidis C, Iliopoulos I. Metagenomics: tools and insights for analyzing next-generation sequencing data derived from biodiversity studies. Bioinform Biol Insights 2015; 9:75-88. [PMID: 25983555 PMCID: PMC4426941 DOI: 10.4137/bbi.s12462] [Citation(s) in RCA: 176] [Impact Index Per Article: 19.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2014] [Revised: 03/09/2015] [Accepted: 03/13/2015] [Indexed: 12/14/2022] Open
Abstract
Advances in next-generation sequencing (NGS) have allowed significant breakthroughs in microbial ecology studies. This has led to the rapid expansion of research in the field and the establishment of "metagenomics", often defined as the analysis of DNA from microbial communities in environmental samples without prior need for culturing. Many metagenomics statistical/computational tools and databases have been developed in order to allow the exploitation of the huge influx of data. In this review article, we provide an overview of the sequencing technologies and how they are uniquely suited to various types of metagenomic studies. We focus on the currently available bioinformatics techniques, tools, and methodologies for performing each individual step of a typical metagenomic dataset analysis. We also provide future trends in the field with respect to tools and technologies currently under development. Moreover, we discuss data management, distribution, and integration tools that are capable of performing comparative metagenomic analyses of multiple datasets using well-established databases, as well as commonly used annotation standards.
Collapse
Affiliation(s)
- Anastasis Oulas
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
| | - Christina Pavloudi
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
- Department of Biology, University of Ghent, Ghent, Belgium
- Department of Microbial Ecophysiology, University of Bremen, Bremen, Germany
| | - Paraskevi Polymenakou
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
| | - Georgios A Pavlopoulos
- Division of Basic Sciences, University of Crete, Medical School, Heraklion, Crete, Greece
| | - Nikolas Papanikolaou
- Division of Basic Sciences, University of Crete, Medical School, Heraklion, Crete, Greece
| | - Georgios Kotoulas
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
| | - Christos Arvanitidis
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
| | - Ioannis Iliopoulos
- Division of Basic Sciences, University of Crete, Medical School, Heraklion, Crete, Greece
| |
Collapse
|
7
|
Abstract
Background Reducing the effects of sequencing errors and PCR artifacts has emerged as an essential component in amplicon-based metagenomic studies. Denoising algorithms have been designed that can reduce error rates in mock community data, but they change the sequence data in a manner that can be inconsistent with the process of removing errors in studies of real communities. In addition, they are limited by the size of the dataset and the sequencing technology used. Results FlowClus uses a systematic approach to filter and denoise reads efficiently. When denoising real datasets, FlowClus provides feedback about the process that can be used as the basis to adjust the parameters of the algorithm to suit the particular dataset. When used to analyze a mock community dataset, FlowClus produced a lower error rate compared to other denoising algorithms, while retaining significantly more sequence information. Among its other attributes, FlowClus can analyze longer reads being generated from all stages of 454 sequencing technology, as well as from Ion Torrent. It has processed a large dataset of 2.2 million GS-FLX Titanium reads in twelve hours; using its more efficient (but less precise) trie analysis option, this time was further reduced, to seven minutes. Conclusions Many of the amplicon-based metagenomics datasets generated over the last several years have been processed through a denoising pipeline that likely caused deleterious effects on the raw data. By using FlowClus, one can avoid such negative outcomes while maintaining control over the filtering and denoising processes. Because of its efficiency, FlowClus can be used to re-analyze multiple large datasets together, thereby leading to more standardized conclusions. FlowClus is freely available on GitHub (jsh58/FlowClus); it is written in C and supported on Linux. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0532-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- John M Gaspar
- Department of Molecular Cellular & Biomedical Sciences, University of New Hampshire, Durham, NH, USA.
| | - W Kelley Thomas
- Department of Molecular Cellular & Biomedical Sciences, University of New Hampshire, Durham, NH, USA.
| |
Collapse
|
8
|
|
9
|
Knief C. Analysis of plant microbe interactions in the era of next generation sequencing technologies. Front Plant Sci 2014; 5:216. [PMID: 24904612 PMCID: PMC4033234 DOI: 10.3389/fpls.2014.00216] [Citation(s) in RCA: 91] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2014] [Accepted: 04/30/2014] [Indexed: 05/18/2023]
Abstract
Next generation sequencing (NGS) technologies have impressively accelerated research in biological science during the last years by enabling the production of large volumes of sequence data to a drastically lower price per base, compared to traditional sequencing methods. The recent and ongoing developments in the field allow addressing research questions in plant-microbe biology that were not conceivable just a few years ago. The present review provides an overview of NGS technologies and their usefulness for the analysis of microorganisms that live in association with plants. Possible limitations of the different sequencing systems, in particular sources of errors and bias, are critically discussed and methods are disclosed that help to overcome these shortcomings. A focus will be on the application of NGS methods in metagenomic studies, including the analysis of microbial communities by amplicon sequencing, which can be considered as a targeted metagenomic approach. Different applications of NGS technologies are exemplified by selected research articles that address the biology of the plant associated microbiota to demonstrate the worth of the new methods.
Collapse
Affiliation(s)
- Claudia Knief
- Institute of Crop Science and Resource Conservation—Molecular Biology of the Rhizosphere, Faculty of Agriculture, University of BonnBonn, Germany
| |
Collapse
|
10
|
Zhou X, Rokas A. Prevention, diagnosis and treatment of high-throughput sequencing data pathologies. Mol Ecol 2014; 23:1679-700. [DOI: 10.1111/mec.12680] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2013] [Revised: 01/17/2014] [Accepted: 01/22/2014] [Indexed: 12/17/2022]
Affiliation(s)
- Xiaofan Zhou
- Department of Biological Sciences; Vanderbilt University; Nashville TN 37235 USA
| | - Antonis Rokas
- Department of Biological Sciences; Vanderbilt University; Nashville TN 37235 USA
| |
Collapse
|
11
|
Abstract
Background The field of population genetics use the genetic composition of populations to study the effects of ecological and evolutionary factors, including selection, genetic drift, mating structure, and migration. Until recently, these studies were usually based upon the analysis of relatively few (typically 10–20) DNA markers on samples from multiple populations. In contrast, high-throughput sequencing provides large amounts of data and consequently very high resolution genetic information. Recent technological developments are rapidly making this a cost-effective alternative. In addition, sequencing allows both the direct study of genomic differences between population, and the discovery of single nucleotide polymorphism marker that can be subsequently used in high-throughput genotyping. Much of the analysis in population genetics was developed before large scale sequencing became feasible. Methods often do not take into account the characteristics of the different sequencing technologies, and consequently, may not always be well suited to this kind of data. Results Although the FlowSim suite of tools originally targeted simulation of de novo 454 genomics data, recent developments and enhancements makes it suitable also for simulating other kinds of data. We examine its application to population genomics, and provide examples and supplementary scripts and utilities to aid in this task. Conclusions Simulation is an important tool to study and develop methods in many fields, and here we demonstrate how to simulate a high-throughput sequencing dataset for population genomics.
Collapse
Affiliation(s)
- Ketil Malde
- Institute of Marine Research, Nordnesgaten 50, Bergen, Norway.
| |
Collapse
|
12
|
Abstract
MOTIVATION Determining the fraction of the diversity within a microbial community sampled and the amount of sequencing required to cover the total diversity represent challenging issues for metagenomics studies. Owing to these limitations, central ecological questions with respect to the global distribution of microbes and the functional diversity of their communities cannot be robustly assessed. RESULTS We introduce Nonpareil, a method to estimate and project coverage in metagenomes. Nonpareil does not rely on high-quality assemblies, operational taxonomic unit calling or comprehensive reference databases; thus, it is broadly applicable to metagenomic studies. Application of Nonpareil on available metagenomic datasets provided estimates on the relative complexity of soil, freshwater and human microbiome communities, and suggested that ∼200 Gb of sequencing data are required for 95% abundance-weighted average coverage of the soil communities analyzed. AVAILABILITY AND IMPLEMENTATION Nonpareil is available at https://github.com/lmrodriguezr/nonpareil/ under the Artistic License 2.0.
Collapse
Affiliation(s)
- Luis M Rodriguez-R
- Center for Bioinformatics and Computational Genomics, School of Biology and School of Civil and Environmental Engineering, Georgia Institute of Technology, 311 Ferst Drive, Ford ES&T Building, Suite 3224, Atlanta, GA 30332, USA
| | | |
Collapse
|
13
|
Plough LV, Marko PB. Characterization of microsatellite loci and repeat density in the gooseneck barnacle, Pollicipes elegans, using next generation sequencing. J Hered 2013; 105:136-42. [PMID: 24115106 DOI: 10.1093/jhered/est064] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Pollicipes elegans is a commercially important and biogeographically significant rocky-shore gooseneck barnacle found along the eastern Pacific coasts of Peru, El Salvador, and Mexico. Little is known about its reproductive biology, and no genetic resources exist despite its growing importance as a fisheries species in the region. Next generation sequencing methods can provide rapid and cost-effective development of molecular markers such as microsatellites, which can be applied to studies of paternity, parentage, and population structure in this understudied species. Here, we used Roche 454 pyrosequencing to develop microsatellite markers in P. elegans and made genomic comparisons of repeat density and repeat class frequency with other arthropods and more distantly related taxa. We identified 13 809 repeats of 1-6 bp, or a density of 9744 bp of repeat per megabase queried, which was intermediate in the range of taxonomic groups compared. Comparison of repeat class frequency distributions revealed that P. elegans was most similar to Drosophila melanogaster rather than the more closely related crustacean Daphnia pulex. We successfully isolated 15 polymorphic markers with an average of 9.4 alleles per locus and average observed and expected heterozygosities of 0.501 and 0.597, respectively. Four loci were found to be out of Hardy-Weinberg equilibrium, likely due to the presence of null alleles. A preliminary population genetic analysis revealed low but significant differentiation between a Peruvian (n = 47) and Mexican (n = 48) population (F(ST) = 0.039) and markedly reduced genetic diversity in Peru. These markers should facilitate future studies of paternity, parentage, and population structure in this species.
Collapse
Affiliation(s)
- Louis V Plough
- the Department of Biological Sciences, Clemson University, 132 Long Hall, Clemson 29634, South Carolina
| | | |
Collapse
|