1
|
Ruohan W, Yuwei Z, Mengbo W, Xikang F, Jianping W, Shuai Cheng L. Resolving single-cell copy number profiling for large datasets. Brief Bioinform 2022; 23:6633647. [PMID: 35801503 DOI: 10.1093/bib/bbac264] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 05/29/2022] [Accepted: 06/06/2022] [Indexed: 11/14/2022] Open
Abstract
The advances of single-cell DNA sequencing (scDNA-seq) enable us to characterize the genetic heterogeneity of cancer cells. However, the high noise and low coverage of scDNA-seq impede the estimation of copy number variations (CNVs). In addition, existing tools suffer from intensive execution time and often fail on large datasets. Here, we propose SeCNV, an efficient method that leverages structural entropy, to profile the copy numbers. SeCNV adopts a local Gaussian kernel to construct a matrix, depth congruent map (DCM), capturing the similarities between any two bins along the genome. Then, SeCNV partitions the genome into segments by minimizing the structural entropy from the DCM. With the partition, SeCNV estimates the copy numbers within each segment for cells. We simulate nine datasets with various breakpoint distributions and amplitudes of noise to benchmark SeCNV. SeCNV achieves a robust performance, i.e. the F1-scores are higher than 0.95 for breakpoint detections, significantly outperforming state-of-the-art methods. SeCNV successfully processes large datasets (>50 000 cells) within 4 min, while other tools fail to finish within the time limit, i.e. 120 h. We apply SeCNV to single-nucleus sequencing datasets from two breast cancer patients and acoustic cell tagmentation sequencing datasets from eight breast cancer patients. SeCNV successfully reproduces the distinct subclones and infers tumor heterogeneity. SeCNV is available at https://github.com/deepomicslab/SeCNV.
Collapse
Affiliation(s)
- Wang Ruohan
- Department of Computer Science at City University of Hong Kong
| | - Zhang Yuwei
- Department of Computer Science at City University of Hong Kong
| | - Wang Mengbo
- Department of Computer Science at City University of Hong Kong
| | - Feng Xikang
- School of Software, Northwestern Polytechnical University
| | - Wang Jianping
- Department of Computer Science at City University of Hong Kong
| | - Li Shuai Cheng
- Department of Computer Science at City University of Hong Kong
| |
Collapse
|
2
|
Baratta AM, Brandner AJ, Plasil SL, Rice RC, Farris SP. Advancements in Genomic and Behavioral Neuroscience Analysis for the Study of Normal and Pathological Brain Function. Front Mol Neurosci 2022; 15:905328. [PMID: 35813067 PMCID: PMC9259865 DOI: 10.3389/fnmol.2022.905328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2022] [Accepted: 06/06/2022] [Indexed: 11/16/2022] Open
Abstract
Psychiatric and neurological disorders are influenced by an undetermined number of genes and molecular pathways that may differ among afflicted individuals. Functionally testing and characterizing biological systems is essential to discovering the interrelationship among candidate genes and understanding the neurobiology of behavior. Recent advancements in genetic, genomic, and behavioral approaches are revolutionizing modern neuroscience. Although these tools are often used separately for independent experiments, combining these areas of research will provide a viable avenue for multidimensional studies on the brain. Herein we will briefly review some of the available tools that have been developed for characterizing novel cellular and animal models of human disease. A major challenge will be openly sharing resources and datasets to effectively integrate seemingly disparate types of information and how these systems impact human disorders. However, as these emerging technologies continue to be developed and adopted by the scientific community, they will bring about unprecedented opportunities in our understanding of molecular neuroscience and behavior.
Collapse
Affiliation(s)
- Annalisa M. Baratta
- Center for Neuroscience, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States
| | - Adam J. Brandner
- Center for Neuroscience, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States
| | - Sonja L. Plasil
- Department of Pharmacology & Chemical Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States
| | - Rachel C. Rice
- Center for Neuroscience, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States
| | - Sean P. Farris
- Center for Neuroscience, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States
- Department of Anesthesiology and Perioperative Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States
- *Correspondence: Sean P. Farris,
| |
Collapse
|
3
|
Waters NR, Abram F, Brennan F, Holmes A, Pritchard L. riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions. Nucleic Acids Res 2019; 46:e68. [PMID: 29608703 PMCID: PMC6009695 DOI: 10.1093/nar/gky212] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2017] [Accepted: 03/12/2018] [Indexed: 11/12/2022] Open
Abstract
The vast majority of bacterial genome sequencing has been performed using Illumina short reads. Because of the inherent difficulty of resolving repeated regions with short reads alone, only ∼10% of sequencing projects have resulted in a closed genome. The most common repeated regions are those coding for ribosomal operons (rDNAs), which occur in a bacterial genome between 1 and 15 times, and are typically used as sequence markers to classify and identify bacteria. Here, we exploit the genomic context in which rDNAs occur across taxa to improve assembly of these regions relative to de novo sequencing by using the conserved nature of rDNAs across taxa and the uniqueness of their flanking regions within a genome. We describe a method to construct targeted pseudocontigs generated by iteratively assembling reads that map to a reference genome’s rDNAs. These pseudocontigs are then used to more accurately assemble the newly sequenced chromosome. We show that this method, implemented as riboSeed, correctly bridges across adjacent contigs in bacterial genome assembly and, when used in conjunction with other genome polishing tools, can assist in closure of a genome.
Collapse
Affiliation(s)
- Nicholas R Waters
- Microbiology, School of Natural Sciences, National University of Ireland, Galway, H91 TK33, Ireland.,Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, Scotland
| | - Florence Abram
- Microbiology, School of Natural Sciences, National University of Ireland, Galway, H91 TK33, Ireland
| | - Fiona Brennan
- Microbiology, School of Natural Sciences, National University of Ireland, Galway, H91 TK33, Ireland.,Soil and Environmental Microbiology, Environmental Research Centre, Teagasc, Johnstown Castle, Wexford, Y35 TC97, Ireland
| | - Ashleigh Holmes
- Cell and Molecular Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, Scotland
| | - Leighton Pritchard
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, Scotland
| |
Collapse
|
4
|
Goldstein S, Beka L, Graf J, Klassen JL. Evaluation of strategies for the assembly of diverse bacterial genomes using MinION long-read sequencing. BMC Genomics 2019; 20:23. [PMID: 30626323 PMCID: PMC6325685 DOI: 10.1186/s12864-018-5381-7] [Citation(s) in RCA: 86] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2018] [Accepted: 12/16/2018] [Indexed: 11/23/2022] Open
Abstract
Background Short-read sequencing technologies have made microbial genome sequencing cheap and accessible. However, closing genomes is often costly and assembling short reads from genomes that are repetitive and/or have extreme %GC content remains challenging. Long-read, single-molecule sequencing technologies such as the Oxford Nanopore MinION have the potential to overcome these difficulties, although the best approach for harnessing their potential remains poorly evaluated. Results We sequenced nine bacterial genomes spanning a wide range of GC contents using Illumina MiSeq and Oxford Nanopore MinION sequencing technologies to determine the advantages of each approach, both individually and combined. Assemblies using only MiSeq reads were highly accurate but lacked contiguity, a deficiency that was partially overcome by adding MinION reads to these assemblies. Even more contiguous genome assemblies were generated by using MinION reads for initial assembly, but these assemblies were more error-prone and required further polishing. This was especially pronounced when Illumina libraries were biased, as was the case for our strains with both high and low GC content. Increased genome contiguity dramatically improved the annotation of insertion sequences and secondary metabolite biosynthetic gene clusters, likely because long-reads can disambiguate these highly repetitive but biologically important genomic regions. Conclusions Genome assembly using short-reads is challenged by repetitive sequences and extreme GC contents. Our results indicate that these difficulties can be largely overcome by using single-molecule, long-read sequencing technologies such as the Oxford Nanopore MinION. Using MinION reads for assembly followed by polishing with Illumina reads generated the most contiguous genomes with sufficient accuracy to enable the accurate annotation of important but difficult to sequence genomic features such as insertion sequences and secondary metabolite biosynthetic gene clusters. The combination of Oxford Nanopore and Illumina sequencing can therefore cost-effectively advance studies of microbial evolution and genome-driven drug discovery. Electronic supplementary material The online version of this article (10.1186/s12864-018-5381-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sarah Goldstein
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Lidia Beka
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Joerg Graf
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA.
| | - Jonathan L Klassen
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA.
| |
Collapse
|
5
|
Khan M, Fadaie Z, Cornelis SS, Cremers FPM, Roosing S. Identification and Analysis of Genes Associated with Inherited Retinal Diseases. Methods Mol Biol 2019; 1834:3-27. [PMID: 30324433 DOI: 10.1007/978-1-4939-8669-9_1] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Inherited retinal diseases (IRDs) display a very high degree of clinical and genetic heterogeneity, which poses challenges in finding the underlying defects in known IRD-associated genes and in identifying novel IRD-associated genes. Knowledge on the molecular and clinical aspects of IRDs has increased tremendously in the last decade. Here, we outline the state-of-the-art techniques to find the causative genetic variants, with special attention for next-generation sequencing which can combine molecular diagnostics and retinal disease gene identification. An important aspect is the functional assessment of rare variants with RNA and protein effects which can only be predicted in silico. We therefore describe the in vitro assessment of putative splice defects in human embryonic kidney cells. In addition, we outline the use of stem cell technology to generate photoreceptor precursor cells from patients' somatic cells which can subsequently be used for RNA and protein studies. Finally, we outline the in silico methods to interpret the causality of variants associated with inherited retinal disease and the registry of these variants.
Collapse
Affiliation(s)
- Mubeen Khan
- Department of Human Genetics, Donders Institute for Brain Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Zeinab Fadaie
- Department of Human Genetics, Donders Institute for Brain Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Stéphanie S Cornelis
- Department of Human Genetics, Donders Institute for Brain Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Frans P M Cremers
- Department of Human Genetics, Donders Institute for Brain Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Susanne Roosing
- Department of Human Genetics, Donders Institute for Brain Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands.
| |
Collapse
|
6
|
Pu D, Xiao P. A real-time decoding sequencing technology—new possibility for high throughput sequencing. RSC Adv 2017. [DOI: 10.1039/c7ra06202h] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
The challenges and corresponding solutions for a decoding sequencing to be compatible with high throughput sequencing (HTS) technologies are provided.
Collapse
Affiliation(s)
- Dan Pu
- School of Bioinformatics
- Chongqing University of Posts and Telecommunications
- Chongqing
- China
- State Key Laboratory of Bioelectronics
| | - Pengfeng Xiao
- State Key Laboratory of Bioelectronics
- School of Biological Science and Medical Engineering
- Southeast University
- Nanjing
- China
| |
Collapse
|
7
|
Feng W, Zhao S, Xue D, Song F, Li Z, Chen D, He B, Hao Y, Wang Y, Liu Y. Improving alignment accuracy on homopolymer regions for semiconductor-based sequencing technologies. BMC Genomics 2016; 17 Suppl 7:521. [PMID: 27556417 PMCID: PMC5001236 DOI: 10.1186/s12864-016-2894-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Ion Torrent and Ion Proton are semiconductor-based sequencing technologies that feature rapid sequencing speed and low upfront and operating costs, thanks to the avoidance of modified nucleotides and optical measurements. Despite of these advantages, however, Ion semiconductor sequencing technologies suffer much reduced sequencing accuracy at the genomic loci with homopolymer repeats of the same nucleotide. Such limitation significantly reduces its efficiency for the biological applications aiming at accurately identifying various genetic variants. RESULTS In this study, we propose a Bayesian inference-based method that takes the advantage of the signal distributions of the electrical voltages that are measured for all the homopolymers of a fixed length. By cross-referencing the length of homopolymers in the reference genome and the voltage signal distribution derived from the experiment, the proposed integrated model significantly improves the alignment accuracy around the homopolymer regions. CONCLUSIONS Besides improving alignment accuracy on homopolymer regions for semiconductor-based sequencing technologies with the proposed model, similar strategies can also be used on other high-throughput sequencing technologies that share similar limitations.
Collapse
Affiliation(s)
- Weixing Feng
- Automation College, Harbin Engineering University, Harbin, Heilongjiang 150001, People’s Republic of China
| | - Sen Zhao
- Automation College, Harbin Engineering University, Harbin, Heilongjiang 150001, People’s Republic of China
| | - Dingkai Xue
- Automation College, Harbin Engineering University, Harbin, Heilongjiang 150001, People’s Republic of China
| | - Fengfei Song
- Automation College, Harbin Engineering University, Harbin, Heilongjiang 150001, People’s Republic of China
| | - Ziwei Li
- Automation College, Harbin Engineering University, Harbin, Heilongjiang 150001, People’s Republic of China
| | - Duojiao Chen
- Automation College, Harbin Engineering University, Harbin, Heilongjiang 150001, People’s Republic of China
| | - Bo He
- Automation College, Harbin Engineering University, Harbin, Heilongjiang 150001, People’s Republic of China
| | - Yangyang Hao
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202 USA
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150001, People’s Republic of China
| | - Yunlong Liu
- Automation College, Harbin Engineering University, Harbin, Heilongjiang 150001, People’s Republic of China
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202 USA
| |
Collapse
|
8
|
SNP Mining in Functional Genes from Nonmodel Species by Next-Generation Sequencing: A Case of Flowering, Pre-Harvest Sprouting, and Dehydration Resistant Genes in Wheat. BIOMED RESEARCH INTERNATIONAL 2016; 2016:3524908. [PMID: 27051662 PMCID: PMC4808660 DOI: 10.1155/2016/3524908] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/30/2015] [Accepted: 02/18/2016] [Indexed: 11/29/2022]
Abstract
As plenty of nonmodel plants are without genomic sequences, the combination of molecular technologies and the next generation sequencing (NGS) platform has led to a new approach to study the genetic variations of these plants. Software GATK, SOAPsnp, samtools, and others are often used to deal with the NGS data. In this study, BLAST was applied to call SNPs from 16 mixed functional gene's sequence data of polyploidy wheat. In total 1.2 million reads were obtained with the average of 7500 reads per genes. To get accurate information, 390,992 pair reads were successfully assembled before aligning to those functional genes. Standalone BLAST tools were used to map assembled sequence to functional genes, respectively. Polynomial fitting was applied to find the suitable minor allele frequency (MAF) threshold at 6% for assembled reads of each functional gene. SNPs accuracy form assembled reads, pretrimmed reads, and original reads were compared, which declared that SNPs mined from the assembled reads were more reliable than others. It was also demonstrated that mixed samples' NGS sequences and then analysis by BLAST were an effective, low-cost, and accurate way to mine SNPs for nonmodel species. Assembled reads and polynomial fitting threshold were recommended for more accurate SNPs target.
Collapse
|
9
|
Chen TW, Gan RC, Chang YF, Liao WC, Wu TH, Lee CC, Huang PJ, Lee CY, Chen YYM, Chiu CH, Tang P. Is the whole greater than the sum of its parts? De novo assembly strategies for bacterial genomes based on paired-end sequencing. BMC Genomics 2015; 16:648. [PMID: 26315384 PMCID: PMC4552406 DOI: 10.1186/s12864-015-1859-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Accepted: 08/18/2015] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND Whole genome sequence construction is becoming increasingly feasible because of advances in next generation sequencing (NGS), including increasing throughput and read length. By simply overlapping paired-end reads, we can obtain longer reads with higher accuracy, which can facilitate the assembly process. However, the influences of different library sizes and assembly methods on paired-end sequencing-based de novo assembly remain poorly understood. RESULTS We used 250 bp Illumina Miseq paired-end reads of different library sizes generated from genomic DNA from Escherichia coli DH1 and Streptococcus parasanguinis FW213 to compare the assembly results of different library sizes and assembly approaches. Our data indicate that overlapping paired-end reads can increase read accuracy but sometimes cause insertion or deletions. Regarding genome assembly, merged reads only outcompete original paired-end reads when coverage depth is low, and larger libraries tend to yield better assembly results. These results imply that distance information is the most critical factor during assembly. Our results also indicate that when depth is sufficiently high, assembly from subsets can sometimes produce better results. CONCLUSIONS In summary, this study provides systematic evaluations of de novo assembly from paired end sequencing data. Among the assembly strategies, we find that overlapping paired-end reads is not always beneficial for bacteria genome assembly and should be avoided or used with caution especially for genomes containing high fraction of repetitive sequences. Because increasing numbers of projects aim at bacteria genome sequencing, our study provides valuable suggestions for the field of genomic sequence construction.
Collapse
Affiliation(s)
- Ting-Wen Chen
- Bioinformatics Core Laboratory, Molecular Medicine Research Center, Chang Gung University, Taoyuan, Taiwan.
| | - Ruei-Chi Gan
- Bioinformatics Core Laboratory, Molecular Medicine Research Center, Chang Gung University, Taoyuan, Taiwan.
| | - Yi-Feng Chang
- Bioinformatics Core Laboratory, Molecular Medicine Research Center, Chang Gung University, Taoyuan, Taiwan.
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan.
| | - Wei-Chao Liao
- Bioinformatics Core Laboratory, Molecular Medicine Research Center, Chang Gung University, Taoyuan, Taiwan.
| | | | - Chi-Ching Lee
- Bioinformatics Core Laboratory, Molecular Medicine Research Center, Chang Gung University, Taoyuan, Taiwan.
| | - Po-Jung Huang
- Bioinformatics Core Laboratory, Molecular Medicine Research Center, Chang Gung University, Taoyuan, Taiwan.
| | - Cheng-Yang Lee
- Bioinformatics Core Laboratory, Molecular Medicine Research Center, Chang Gung University, Taoyuan, Taiwan.
| | - Yi-Ywan M Chen
- Department of Microbiology and Immunology, Chang Gung University, Taoyuan, Taiwan.
- Graduate Institute of Biomedical Sciences, Chang Gung University, Taoyuan, Taiwan.
| | - Cheng-Hsun Chiu
- Molecular Infectious Diseases Research Center, Chang Gung Memorial Hospital, Taoyuan, Taiwan.
| | - Petrus Tang
- Bioinformatics Core Laboratory, Molecular Medicine Research Center, Chang Gung University, Taoyuan, Taiwan.
- Graduate Institute of Biomedical Sciences, Chang Gung University, Taoyuan, Taiwan.
- Molecular Infectious Diseases Research Center, Chang Gung Memorial Hospital, Taoyuan, Taiwan.
| |
Collapse
|
10
|
Abstract
The next generation sequencing (NGS) is an important process which assures inexpensive organization of vast size of raw sequence dataset over any traditional sequencing systems or methods. Various aspects of NGS such as template preparation, sequencing imaging and genome alignment and assembly outline the genome sequencing and alignment. Consequently, de Bruijn graph (dBG) is an important mathematical tool that graphically analyzes how the orientations are constructed in groups of nucleotides. Basically, dBG describes the formation of the genome segments in circular iterative fashions. Some pivotal dBG-based de novo algorithms and software packages such as T-IDBA, Oases, IDBA-tran, Euler, Velvet, ABySS, AllPaths, SOAPde novo and SOAPde novo2 are illustrated in this paper. Consequently, overlap layout consensus (OLC) graph-based algorithms also play vital role in NGS assembly. Some important OLC-based algorithms such as MIRA3, CABOG, Newbler, Edena, Mosaik and SHORTY are portrayed in this paper. It has been experimented that greedy graph-based algorithms and software packages are also vital for proper genome dataset assembly. A few algorithms named SSAKE, SHARCGS and VCAKE help to perform proper genome sequencing.
Collapse
Affiliation(s)
- Sonia Farhana Nimmy
- Department of Computer Science and Engineering, BGC Trust University, BGC Biddha Nagar, Chandanaish, Chittagong, Bangladesh
| | - M. S. Kamal
- Department of Computer Science and Engineering, BGC Trust University, BGC Biddha Nagar, Chandanaish, Chittagong, Bangladesh
| |
Collapse
|
11
|
Feng W, Sang P, Lian D, Dong Y, Song F, Li M, He B, Cao F, Liu Y. ResSeq: Enhancing Short-Read Sequencing Alignment By Rescuing Error-Containing Reads. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:795-798. [PMID: 26357318 DOI: 10.1109/tcbb.2014.2366103] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
UNLABELLED Next-generation short-read sequencing is widely utilized in genomic studies. Biological applications require an alignment step to map sequencing reads to the reference genome, before acquiring expected genomic information. This requirement makes alignment accuracy a key factor for effective biological interpretation. Normally, when accounting for measurement errors and single nucleotide polymorphisms, short read mappings with a few mismatches are generally considered acceptable. However, to further improve the efficiency of short-read sequencing alignment, we propose a method to retrieve additional reliably aligned reads (reads with more than a pre-defined number of mismatches), using a Bayesian-based approach. In this method, we first retrieve the sequence context around the mismatched nucleotides within the already aligned reads; these loci contain the genomic features where sequencing errors occur. Then, using the derived pattern, we evaluate the remaining (typically discarded) reads with more than the allowed number of mismatches, and calculate a score that represents the probability that a specific alignment is correct. This strategy allows the extraction of more reliably aligned reads, therefore improving alignment sensitivity. IMPLEMENTATION The source code of our tool, ResSeq, can be downloaded from: https://github.com/hrbeubiocenter/Resseq.
Collapse
|
12
|
Abstract
Traditionally, microbial genome sequencing has been restricted to the small number of species that can be grown in pure culture. The progressive development of culture-independent methods over the last 15 years now allows researchers to sequence microbial communities directly from environmental samples. This approach is commonly referred to as "metagenomics" or "community genomics". However, the term metagenomics is applied liberally in the literature to describe any culture-independent analysis of microbial communities. Here, we define metagenomics as shotgun ("random") sequencing of the genomic DNA of a sample taken directly from the environment. The metagenome can be thought of as a sampling of the collective genome of the microbial community. We outline the considerations and analyses that should be undertaken to ensure the success of a metagenomic sequencing project, including the choice of sequencing platform and methods for assembly, binning, annotation, and comparative analysis.
Collapse
Affiliation(s)
- Lauren Bragg
- Advanced Water Management Centre, The University of Queensland, St. Lucia, QLD, Australia
| | | |
Collapse
|
13
|
Pu D, Qi Y, Cui L, Xiao P, Lu Z. A real-time decoding sequencing based on dual mononucleotide addition for cyclic synthesis. Anal Chim Acta 2014; 852:274-83. [PMID: 25441908 DOI: 10.1016/j.aca.2014.09.009] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2014] [Revised: 08/28/2014] [Accepted: 09/08/2014] [Indexed: 11/19/2022]
Abstract
We propose a real-time decoding sequencing strategy in which a template is determined without directly measuring base sequence but by decoding two sets of encodings obtained from two parallel sequencing runs. This strategy relies on adding a mixture of different two-base pair, A+G, C+T, A+C, G+T, A+T or C+G (abbreviated as AG, CT, AC, GT, AT, or CG), into the reaction each time. When a template is cyclically interrogated twice with any two kinds of dual mononucleotide addition (AG/CT, AC/GT, and AT/CG), two sets of encodings are obtained sequentially. The two sets of encodings allow for the bases to be sequentially decoded, moving from first to last, in a deterministic manner. This strategy applies fewer cycles to obtain longer read length compared to the traditional real-time sequencing strategy. Partial rnpB gene was applied to verify the applicability of the decoding strategy via pyrosequencing. The results indicated that the sequence could be reconstructed by decoding two sets of encodings. Moreover, streptococcal strains could be differentiated by comparing signal intensity in each cycle and encoding size of each template. This strategy is likely to be applied to differentiate nucleic acid sequence as encoding size and signal intensity in each cycle vary with the base size and composition. Furthermore, it has the potential in building a promising strategy that could be utilized as an alternative to conventional sequencing systems.
Collapse
Affiliation(s)
- Dan Pu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Yuhua Qi
- Jiangsu Provincial Center for Disease Control and Prevention, Nanjing, Jiangsu 210009, China
| | - Lunbiao Cui
- Jiangsu Provincial Center for Disease Control and Prevention, Nanjing, Jiangsu 210009, China
| | - Pengfeng Xiao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China.
| | - Zuhong Lu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China.
| |
Collapse
|
14
|
Liu B, Morrison CD, Johnson CS, Trump DL, Qin M, Conroy JC, Wang J, Liu S. Computational methods for detecting copy number variations in cancer genome using next generation sequencing: principles and challenges. Oncotarget 2014; 4:1868-81. [PMID: 24240121 PMCID: PMC3875755 DOI: 10.18632/oncotarget.1537] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Accurate detection of somatic copy number variations (CNVs) is an essential part of cancer genome analysis, and plays an important role in oncotarget identifications. Next generation sequencing (NGS) holds the promise to revolutionize somatic CNV detection. In this review, we provide an overview of current analytic tools used for CNV detection in NGS-based cancer studies. We summarize the NGS data types used for CNV detection, decipher the principles for data preprocessing, segmentation, and interpretation, and discuss the challenges in somatic CNV detection. This review aims to provide a guide to the analytic tools used in NGS-based cancer CNV studies, and to discuss the important factors that researchers need to consider when analyzing NGS data for somatic CNV detections.
Collapse
Affiliation(s)
- Biao Liu
- Center for Personalized Medicine, Roswell Park Cancer Institute, Buffalo, NY
| | | | | | | | | | | | | | | |
Collapse
|
15
|
The eukaryotic genome, its reads, and the unfinished assembly. FEBS Lett 2013; 587:2090-3. [PMID: 23727201 DOI: 10.1016/j.febslet.2013.05.048] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2013] [Revised: 05/09/2013] [Accepted: 05/20/2013] [Indexed: 11/21/2022]
Abstract
In recent years, readily affordable short read sequences provided by next-generation sequencing (NGS) have become longer and more accurate. This has led to a jump in interest in the utility of NGS-only approaches for exploring eukaryotic genomes. The concept of a static, 'finished' genome assembly, which still appears to be a faraway goal for many eukaryotes, is yielding to new paradigms. We here motivate an object-view concept where the raw reads are the main, fixed object, and assemblies with their annotations take a role of dynamically changing and modifiable views of that object.
Collapse
|
16
|
Forde BM, O'Toole PW. Next-generation sequencing technologies and their impact on microbial genomics. Brief Funct Genomics 2013; 12:440-53. [PMID: 23314033 DOI: 10.1093/bfgp/els062] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Next-generation sequencing technologies have had a dramatic impact in the field of genomic research through the provision of a low cost, high-throughput alternative to traditional capillary sequencers. These new sequencing methods have surpassed their original scope and now provide a range of utility-based applications, which allow for a more comprehensive analysis of the structure and content of microbial genomes than was previously possible. With the commercialization of a third generation of sequencing technologies imminent, we discuss the applications of current next-generation sequencing methods and explore their impact on and contribution to microbial genome research.
Collapse
Affiliation(s)
- Brian M Forde
- Department of Microbiology, University College Cork, Cork, Ireland.
| | | |
Collapse
|
17
|
Wang Z, Willard HF. Evidence for sequence biases associated with patterns of histone methylation. BMC Genomics 2012; 13:367. [PMID: 22857523 PMCID: PMC3532361 DOI: 10.1186/1471-2164-13-367] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2011] [Accepted: 07/18/2012] [Indexed: 11/19/2022] Open
Abstract
Background Combinations of histone variants and modifications, conceptually representing a histone code, have been proposed to play a significant role in gene regulation and developmental processes in complex organisms. While various mechanisms have been implicated in establishing and maintaining epigenetic patterns at specific locations in the genome, they are generally believed to be independent of primary DNA sequence on a more global scale. Results To address this systematically in the case of the human genome, we have analyzed primary DNA sequences underlying patterns of 19 different methylated histones in human primary T-cells and patterns of three methylated histones across additional human cell lines. We report strong sequence biases associated with most of these histone marks genome-wide in each cell type. Furthermore, the sequence characteristics for such association are distinct for different groups of histone marks. Conclusions These findings provide evidence of an influence of genomic sequence on patterns of histone modification associated with gene expression and chromatin programming, and they suggest that the mechanisms responsible for global histone modifications may interpret genomic sequence in various ways.
Collapse
Affiliation(s)
- Zhong Wang
- Genome Biology Group, Duke Institute for Genome Sciences & Policy, Duke University, Durham, NC 27708, USA
| | | |
Collapse
|
18
|
Pellin D, Miotto P, Ambrosi A, Cirillo DM, Di Serio C. A genome-wide identification analysis of small regulatory RNAs in Mycobacterium tuberculosis by RNA-Seq and conservation analysis. PLoS One 2012; 7:e32723. [PMID: 22470422 PMCID: PMC3314655 DOI: 10.1371/journal.pone.0032723] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2011] [Accepted: 02/03/2012] [Indexed: 12/29/2022] Open
Abstract
We propose a new method for smallRNAs (sRNAs) identification. First we build an effective target genome (ETG) by means of a strand-specific procedure. Then we propose a new bioinformatic pipeline based mainly on the combination of two types of information: the first provides an expression map based on RNA-seq data (Reads Map) and the second applies principles of comparative genomics leading to a Conservation Map. By superimposing these two maps, a robust method for the search of sRNAs is obtained. We apply this methodology to investigate sRNAs in Mycobacterium tuberculosis H37Rv. This bioinformatic procedure leads to a total list of 1948 candidate sRNAs. The size of the candidate list is strictly related to the aim of the study and to the technology used during the verification process. We provide performance measures of the algorithm in identifying annotated sRNAs reported in three recent published studies.
Collapse
Affiliation(s)
- Danilo Pellin
- University Centre for Statistics in the Biomedical Sciences, Università Vita-Salute San Raffaele, Milan, Italy
| | - Paolo Miotto
- Emerging Bacterial Pathogens Unit, San Raffaele Scientific Institute, Milan, Italy
| | - Alessandro Ambrosi
- University Centre for Statistics in the Biomedical Sciences, Università Vita-Salute San Raffaele, Milan, Italy
| | | | - Clelia Di Serio
- University Centre for Statistics in the Biomedical Sciences, Università Vita-Salute San Raffaele, Milan, Italy
- * E-mail:
| |
Collapse
|
19
|
Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 2012; 28:1086-92. [PMID: 22368243 PMCID: PMC3324515 DOI: 10.1093/bioinformatics/bts094] [Citation(s) in RCA: 1009] [Impact Index Per Article: 84.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Motivation: High-throughput sequencing has made the analysis of new model organisms more affordable. Although assembling a new genome can still be costly and difficult, it is possible to use RNA-seq to sequence mRNA. In the absence of a known genome, it is necessary to assemble these sequences de novo, taking into account possible alternative isoforms and the dynamic range of expression values. Results: We present a software package named Oases designed to heuristically assemble RNA-seq reads in the absence of a reference genome, across a broad spectrum of expression values and in presence of alternative isoforms. It achieves this by using an array of hash lengths, a dynamic filtering of noise, a robust resolution of alternative splicing events and the efficient merging of multiple assemblies. It was tested on human and mouse RNA-seq data and is shown to improve significantly on the transABySS and Trinity de novo transcriptome assemblers. Availability and implementation: Oases is freely available under the GPL license at www.ebi.ac.uk/~zerbino/oases/ Contact:dzerbino@ucsc.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marcel H Schulz
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | | | | | | |
Collapse
|
20
|
Lassen KS, Schultz H, Heegaard NHH, He M. A novel DNAseq program for enhanced analysis of Illumina GAII data: a case study on antibody complementarity-determining regions. N Biotechnol 2012; 29:271-8. [PMID: 22155428 DOI: 10.1016/j.nbt.2011.11.014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2011] [Revised: 11/09/2011] [Accepted: 11/25/2011] [Indexed: 11/16/2022]
Abstract
High-throughput DNA sequencing technologies are increasingly becoming powerful systems for the comprehensive analysis of variations in whole genomes or various DNA libraries. As they are capable of producing massive collections of short sequences with varying lengths, a major challenge is how to turn these reads into biologically meaningful information. The first stage is to assemble the short reads into longer sequences through an in silico process. However, currently available software/programs allow only the assembly of abundant sequences, which apparently results in the loss of highly variable (or rare) sequences or creates artefact assemblies. In this paper, we describe a novel program (DNAseq) that is capable of assembling highly variable sequences and displaying them directly for phylogenetic analysis. In addition, this program is Microsoft Windows-based and runs by a normal PC with 700MB RAM for a general use. We have applied it to analyse a human naive single-chain antibody (scFv) library, comprehensively revealing the diversity of antibody variable complementarity-determining regions (CDRs) and their families. Although only a scFv library was exemplified here, we envisage that this program could be applicable to other genome libraries.
Collapse
Affiliation(s)
- Klaus S Lassen
- Department of Clinical Biochemistry and Immunology, Statens Serum Institut, Artillerivej 5, 2300 Copenhagen S, Denmark.
| | | | | | | |
Collapse
|
21
|
Derrien T, Estellé J, Marco Sola S, Knowles DG, Raineri E, Guigó R, Ribeca P. Fast computation and applications of genome mappability. PLoS One 2012; 7:e30377. [PMID: 22276185 PMCID: PMC3261895 DOI: 10.1371/journal.pone.0030377] [Citation(s) in RCA: 327] [Impact Index Per Article: 27.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2011] [Accepted: 12/19/2011] [Indexed: 01/17/2023] Open
Abstract
We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput sequencing data (SNP calling, gene expression quantification and paired-end experiments). This work highlights mappability as an important concept which deserves to be taken into full account, in particular when massively parallel sequencing technologies are employed. The GEM mappability program belongs to the GEM (GEnome Multitool) suite of programs, which can be freely downloaded for any use from its website (http://gemlibrary.sourceforge.net).
Collapse
Affiliation(s)
- Thomas Derrien
- Institut de Génétique et Développement (IGDR), Université Rennes 1, Rennes, France
- * E-mail: (TD); (PR)
| | - Jordi Estellé
- Centro Nacional de Análisis Genómico (CNAG), Barcelona, Spain
| | | | - David G. Knowles
- Centre for Genomic Regulation (CRG), Universitat Pompeu Fabra, Barcelona, Spain
| | | | - Roderic Guigó
- Centre for Genomic Regulation (CRG), Universitat Pompeu Fabra, Barcelona, Spain
| | - Paolo Ribeca
- Centro Nacional de Análisis Genómico (CNAG), Barcelona, Spain
- * E-mail: (TD); (PR)
| |
Collapse
|
22
|
Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation. BMC Genomics 2012; 13:14. [PMID: 22233127 PMCID: PMC3322347 DOI: 10.1186/1471-2164-13-14] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2011] [Accepted: 01/10/2012] [Indexed: 12/11/2022] Open
Abstract
UNLABELLED Ongoing technological advances in genome sequencing are allowing bacterial genomes to be sequenced at ever-lower cost. However, nearly all of these new techniques concomitantly decrease genome quality, primarily due to the inability of their relatively short read lengths to bridge certain genomic regions, e.g., those containing repeats. Fragmentation of predicted open reading frames (ORFs) is one possible consequence of this decreased quality. In this study we quantify ORF fragmentation in draft microbial genomes and its effect on annotation efficacy, and we propose a solution to ameliorate this problem. RESULTS A survey of draft-quality genomes in GenBank revealed that fragmented ORFs comprised > 80% of the predicted ORFs in some genomes, and that increased fragmentation correlated with decreased genome assembly quality. In a more thorough analysis of 25 Streptomyces genomes, fragmentation was especially enriched in some protein classes with repeating, multi-modular structures such as polyketide synthases, non-ribosomal peptide synthetases and serine/threonine kinases. Overall, increased genome fragmentation correlated with increased false-negative Pfam and COG annotation rates and increased false-positive KEGG annotation rates. The false-positive KEGG annotation rate could be ameliorated by linking fragmented ORFs using their orthologs in related genomes. Whereas this strategy successfully linked up to 46% of the total ORF fragments in some genomes, its sensitivity appeared to depend heavily on the depth of sampling of a particular taxon's variable genome. CONCLUSIONS Draft microbial genomes contain many ORF fragments. Where these correspond to the same gene they have particular potential to confound comparative gene content analyses. Given our findings, and the rapid increase in the number of microbial draft quality genomes, we suggest that accounting for gene fragmentation and its associated biases is important when designing comparative genomic projects.
Collapse
|
23
|
Ji Y, Shi Y, Ding G, Li Y. A new strategy for better genome assembly from very short reads. BMC Bioinformatics 2011; 12:493. [PMID: 22208765 PMCID: PMC3268122 DOI: 10.1186/1471-2105-12-493] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2011] [Accepted: 12/30/2011] [Indexed: 11/29/2022] Open
Abstract
Background With the rapid development of the next generation sequencing (NGS) technology, large quantities of genome sequencing data have been generated. Because of repetitive regions of genomes and some other factors, assembly of very short reads is still a challenging issue. Results A novel strategy for improving genome assembly from very short reads is proposed. It can increase accuracies of assemblies by integrating de novo contigs, and produce comparative contigs by allowing multiple references without limiting to genomes of closely related strains. Comparative contigs are used to scaffold de novo contigs. Using simulated and real datasets, it is shown that our strategy can effectively improve qualities of assemblies of isolated microbial genomes and metagenomes. Conclusions With more and more reference genomes available, our strategy will be useful to improve qualities of genome assemblies from very short reads. Some scripts are provided to make our strategy applicable at http://code.google.com/p/cd-hybrid/.
Collapse
Affiliation(s)
- Yan Ji
- Bioinformatics Center, Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, PR China
| | | | | | | |
Collapse
|
24
|
Nunes MCS, Wanner EF, Weber G. Origin of multiple periodicities in the Fourier power spectra of the Plasmodium falciparum genome. BMC Genomics 2011; 12 Suppl 4:S4. [PMID: 22369134 PMCID: PMC3287587 DOI: 10.1186/1471-2164-12-s4-s4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Background Fourier transforms and their associated power spectra are used for detecting periodicities and protein-coding genes and is generally regarded as a well established technique. Many of the periodicities which have been found with this method are quite well understood such as the periodicity of 3 nt which is associated to codon usage. But what is the origin of the peculiar frequency multiples k/21 which were reported for a tiny section of chromosome 2 in P. falciparum? Are these present in other chromosomes and perhaps in related organisms? And how should we interpret fractional periodicities in genomes? Results We applied the binary indicator power spectrum to all chromosomes of P. falciparum, and found that the frequency overtones k/21 are present only in non-coding sections. We did not find such frequency overtones in any other related genomes. Furthermore, the frequency overtones were identified as artifacts of the way the genome is encoded into a numerical sequence, that is, they are frequency aliases. By choosing a different way to encode the sequence the overtones do not appear. In view of these results, we revisited early applications of this technique to proteins where frequency overtones were reported. Conclusions Some authors hinted recently at the possibility of mapping artifacts and frequency aliases in power spectra. However, in the case of P. falciparum the frequency aliases are particularly strong and can mask the 1/3 frequency which is used for gene detecting. This shows that albeit being a well known technique, with a long history of application in proteins, few researchers seem to be aware of the problems represented by frequency aliases.
Collapse
Affiliation(s)
- Miriam C S Nunes
- Department of Biological Sciences, Federal University of Ouro Preto, 35400-000 Ouro Preto, MG, Brazil
| | | | | |
Collapse
|
25
|
Hampton M, Melvin RG, Kendall AH, Kirkpatrick BR, Peterson N, Andrews MT. Deep sequencing the transcriptome reveals seasonal adaptive mechanisms in a hibernating mammal. PLoS One 2011; 6:e27021. [PMID: 22046435 PMCID: PMC3203946 DOI: 10.1371/journal.pone.0027021] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2011] [Accepted: 10/07/2011] [Indexed: 11/19/2022] Open
Abstract
Mammalian hibernation is a complex phenotype involving metabolic rate reduction, bradycardia, profound hypothermia, and a reliance on stored fat that allows the animal to survive for months without food in a state of suspended animation. To determine the genes responsible for this phenotype in the thirteen-lined ground squirrel (Ictidomys tridecemlineatus) we used the Roche 454 platform to sequence mRNA isolated at six points throughout the year from three key tissues: heart, skeletal muscle, and white adipose tissue (WAT). Deep sequencing generated approximately 3.7 million cDNA reads from 18 samples (6 time points ×3 tissues) with a mean read length of 335 bases. Of these, 3,125,337 reads were assembled into 140,703 contigs. Approximately 90% of all sequences were matched to proteins in the human UniProt database. The total number of distinct human proteins matched by ground squirrel transcripts was 13,637 for heart, 12,496 for skeletal muscle, and 14,351 for WAT. Extensive mitochondrial RNA sequences enabled a novel approach of using the transcriptome to construct the complete mitochondrial genome for I. tridecemlineatus. Seasonal and activity-specific changes in mRNA levels that met our stringent false discovery rate cutoff (1.0 × 10(-11)) were used to identify patterns of gene expression involving various aspects of the hibernation phenotype. Among these patterns are differentially expressed genes encoding heart proteins AT1A1, NAC1 and RYR2 controlling ion transport required for contraction and relaxation at low body temperatures. Abundant RNAs in skeletal muscle coding ubiquitin pathway proteins ASB2, UBC and DDB1 peak in October, suggesting an increase in muscle proteolysis. Finally, genes in WAT that encode proteins involved in lipogenesis (ACOD, FABP4) are highly expressed in August, but gradually decline in expression during the seasonal transition to lipolysis.
Collapse
Affiliation(s)
- Marshall Hampton
- Department of Mathematics and Statistics, University of Minnesota Duluth, Duluth, Minnesota, United States of America
| | - Richard G. Melvin
- Department of Biology, University of Minnesota Duluth, Duluth, Minnesota, United States of America
| | - Anne H. Kendall
- Department of Biology, University of Minnesota Duluth, Duluth, Minnesota, United States of America
| | - Brian R. Kirkpatrick
- Department of Biology, University of Minnesota Duluth, Duluth, Minnesota, United States of America
| | - Nichole Peterson
- BioMedical Genomics Center, University of Minnesota, Saint Paul, Minnesota, United States of America
| | - Matthew T. Andrews
- Department of Biology, University of Minnesota Duluth, Duluth, Minnesota, United States of America
| |
Collapse
|
26
|
Abstract
Background Metagenomic assembly is a challenging problem due to the presence of genetic material from multiple organisms. The problem becomes even more difficult when short reads produced by next generation sequencing technologies are used. Although whole genome assemblers are not designed to assemble metagenomic samples, they are being used for metagenomics due to the lack of assemblers capable of dealing with metagenomic samples. We present an evaluation of assembly of simulated short-read metagenomic samples using a state-of-art de Bruijn graph based assembler. Results We assembled simulated metagenomic reads from datasets of various complexities using a state-of-art de Bruijn graph based parallel assembler. We have also studied the effect of k-mer size used in de Bruijn graph on metagenomic assembly and developed a clustering solution to pool the contigs obtained from different assembly runs, which allowed us to obtain longer contigs. We have also assessed the degree of chimericity of the assembled contigs using an entropy/impurity metric and compared the metagenomic assemblies to assemblies of isolated individual source genomes. Conclusions Our results show that accuracy of the assembled contigs was better than expected for the metagenomic samples with a few dominant organisms and was especially poor in samples containing many closely related strains. Clustering contigs from different k-mer parameter of the de Bruijn graph allowed us to obtain longer contigs, however the clustering resulted in accumulation of erroneous contigs thus increasing the error rate in clustered contigs.
Collapse
Affiliation(s)
- Anveshi Charuvaka
- Computer Science Department, George Mason University, Fairfax, Virginia, USA
| | | |
Collapse
|
27
|
Straub SCK, Fishbein M, Livshultz T, Foster Z, Parks M, Weitemier K, Cronn RC, Liston A. Building a model: developing genomic resources for common milkweed (Asclepias syriaca) with low coverage genome sequencing. BMC Genomics 2011; 12:211. [PMID: 21542930 PMCID: PMC3116503 DOI: 10.1186/1471-2164-12-211] [Citation(s) in RCA: 92] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2011] [Accepted: 05/04/2011] [Indexed: 01/05/2023] Open
Abstract
Background Milkweeds (Asclepias L.) have been extensively investigated in diverse areas of evolutionary biology and ecology; however, there are few genetic resources available to facilitate and compliment these studies. This study explored how low coverage genome sequencing of the common milkweed (Asclepias syriaca L.) could be useful in characterizing the genome of a plant without prior genomic information and for development of genomic resources as a step toward further developing A. syriaca as a model in ecology and evolution. Results A 0.5× genome of A. syriaca was produced using Illumina sequencing. A virtually complete chloroplast genome of 158,598 bp was assembled, revealing few repeats and loss of three genes: accD, clpP, and ycf1. A nearly complete rDNA cistron (18S-5.8S-26S; 7,541 bp) and 5S rDNA (120 bp) sequence were obtained. Assessment of polymorphism revealed that the rDNA cistron and 5S rDNA had 0.3% and 26.7% polymorphic sites, respectively. A partial mitochondrial genome sequence (130,764 bp), with identical gene content to tobacco, was also assembled. An initial characterization of repeat content indicated that Ty1/copia-like retroelements are the most common repeat type in the milkweed genome. At least one A. syriaca microread hit 88% of Catharanthus roseus (Apocynaceae) unigenes (median coverage of 0.29×) and 66% of single copy orthologs (COSII) in asterids (median coverage of 0.14×). From this partial characterization of the A. syriaca genome, markers for population genetics (microsatellites) and phylogenetics (low-copy nuclear genes) studies were developed. Conclusions The results highlight the promise of next generation sequencing for development of genomic resources for any organism. Low coverage genome sequencing allows characterization of the high copy fraction of the genome and exploration of the low copy fraction of the genome, which facilitate the development of molecular tools for further study of a target species and its relatives. This study represents a first step in the development of a community resource for further study of plant-insect co-evolution, anti-herbivore defense, floral developmental genetics, reproductive biology, chemical evolution, population genetics, and comparative genomics using milkweeds, and A. syriaca in particular, as ecological and evolutionary models.
Collapse
Affiliation(s)
- Shannon C K Straub
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, Oregon 97331, USA.
| | | | | | | | | | | | | | | |
Collapse
|
28
|
van Oeveren J, de Ruiter M, Jesse T, van der Poel H, Tang J, Yalcin F, Janssen A, Volpin H, Stormo KE, Bogden R, van Eijk MJT, Prins M. Sequence-based physical mapping of complex genomes by whole genome profiling. Genome Res 2011; 21:618-25. [PMID: 21324881 DOI: 10.1101/gr.112094.110] [Citation(s) in RCA: 72] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
We present whole genome profiling (WGP), a novel next-generation sequencing-based physical mapping technology for construction of bacterial artificial chromosome (BAC) contigs of complex genomes, using Arabidopsis thaliana as an example. WGP leverages short read sequences derived from restriction fragments of two-dimensionally pooled BAC clones to generate sequence tags. These sequence tags are assigned to individual BAC clones, followed by assembly of BAC contigs based on shared regions containing identical sequence tags. Following in silico analysis of WGP sequence tags and simulation of a map of Arabidopsis chromosome 4 and maize, a WGP map of Arabidopsis thaliana ecotype Columbia was constructed de novo using a six-genome equivalent BAC library. Validation of the WGP map using the Columbia reference sequence confirmed that 350 BAC contigs (98%) were assembled correctly, spanning 97% of the 102-Mb calculated genome coverage. We demonstrate that WGP maps can also be generated for more complex plant genomes and will serve as excellent scaffolds to anchor genetic linkage maps and integrate whole genome sequence data.
Collapse
|
29
|
Koehler R, Issac H, Cloonan N, Grimmond SM. The uniqueome: a mappability resource for short-tag sequencing. ACTA ACUST UNITED AC 2010; 27:272-4. [PMID: 21075741 PMCID: PMC3018812 DOI: 10.1093/bioinformatics/btq640] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Summary: Quantification applications of short-tag sequencing data (such as CNVseq and RNAseq) depend on knowing the uniqueness of specific genomic regions at a given threshold of error. Here, we present the ‘uniqueome’, a genomic resource for understanding the uniquely mappable proportion of genomic sequences. Pre-computed data are available for human, mouse, fly and worm genomes in both color-space and nucletotide-space, and we demonstrate the utility of this resource as applied to the quantification of RNAseq data. Availability: Files, scripts and supplementary data are available from http://grimmond.imb.uq.edu.au/uniqueome/; the ISAS uniqueome aligner is freely available from http://www.imagenix.com/. Contact:n.cloonan@uq.edu.au Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|
30
|
Paszkiewicz K, Studholme DJ. De novo assembly of short sequence reads. Brief Bioinform 2010; 11:457-72. [DOI: 10.1093/bib/bbq020] [Citation(s) in RCA: 134] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
31
|
Simulation of ChIP-Seq based on extra-sonication of IPed DNA fragments. CHINESE SCIENCE BULLETIN-CHINESE 2010. [DOI: 10.1007/s11434-010-3013-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
32
|
Read length and repeat resolution: exploring prokaryote genomes using next-generation sequencing technologies. PLoS One 2010; 5:e11518. [PMID: 20634954 PMCID: PMC2902515 DOI: 10.1371/journal.pone.0011518] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2010] [Accepted: 05/31/2010] [Indexed: 11/19/2022] Open
Abstract
Background There are a growing number of next-generation sequencing technologies. At present, the most cost-effective options also produce the shortest reads. However, even for prokaryotes, there is uncertainty concerning the utility of these technologies for the de novo assembly of complete genomes. This reflects an expectation that short reads will be unable to resolve small, but presumably abundant, repeats. Methodology/Principal Findings Using a simple model of repeat assembly, we develop and test a technique that, for any read length, can estimate the occurrence of unresolvable repeats in a genome, and thus predict the number of gaps that would need to be closed to produce a complete sequence. We apply this technique to 818 prokaryote genome sequences. This provides a quantitative assessment of the relative performance of various lengths. Notably, unpaired reads of only 150nt can reconstruct approximately 50% of the analysed genomes with fewer than 96 repeat-induced gaps. Nonetheless, there is considerable variation amongst prokaryotes. Some genomes can be assembled to near contiguity using very short reads while others require much longer reads. Conclusions Given the diversity of prokaryote genomes, a sequencing strategy should be tailored to the organism under study. Our results will provide researchers with a practical resource to guide the selection of the appropriate read length.
Collapse
|
33
|
De novo assembly of a 40 Mb eukaryotic genome from short sequence reads: Sordaria macrospora, a model organism for fungal morphogenesis. PLoS Genet 2010; 6:e1000891. [PMID: 20386741 PMCID: PMC2851567 DOI: 10.1371/journal.pgen.1000891] [Citation(s) in RCA: 140] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2009] [Accepted: 03/02/2010] [Indexed: 01/09/2023] Open
Abstract
Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While next-generation sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30-90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in approximately 4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for comparative studies to address basic questions of fungal biology.
Collapse
|
34
|
Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics 2010; 95:315-27. [PMID: 20211242 DOI: 10.1016/j.ygeno.2010.03.001] [Citation(s) in RCA: 621] [Impact Index Per Article: 44.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2009] [Revised: 02/26/2010] [Accepted: 03/02/2010] [Indexed: 01/08/2023]
Abstract
The emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages have been created or revised specifically for de novo assembly of next-generation sequencing data. This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly.
Collapse
Affiliation(s)
- Jason R Miller
- J. Craig Venter Institute, Rockville, MD 20850-3343, USA.
| | | | | |
Collapse
|
35
|
Webb KM, Rosenthal BM. Deep resequencing of Trichinella spiralis reveals previously un-described single nucleotide polymorphisms and intra-isolate variation within the mitochondrial genome. INFECTION GENETICS AND EVOLUTION 2010; 10:304-10. [PMID: 20083232 DOI: 10.1016/j.meegid.2010.01.003] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/28/2009] [Revised: 12/23/2009] [Accepted: 01/11/2010] [Indexed: 11/25/2022]
Abstract
The phylogeny and historical dispersal of Trichinella spp. have been studied, in part, by sequencing portions of the mitochondrial genome. Such studies rely on two untested beliefs: that variation in a portion is representative of the entire mitochondrial genome, and that each isolate is characterized by only one mitochondrial haplotype. We have used next generation DNA sequencing technology to obtain the complete mitochondrial genome sequence from a second isolate of T. spiralis. By aligning it to the only previously sequenced genome, we sought to establish whether the exceptionally deep sequencing coverage provided by such an approach could detect regions of the genome which had been misassembled, or nucleotide positions which may vary within an isolate. The new data broadly confirm the gene order and sequence assembly for protein-coding regions. However, in the repetitive non-coding region, alignment to the previously published genome sequence proved difficult. Such discrepancies may represent true biological variation, but may rather result from methodological or algorithmic sources. Within the 13,902bp protein-coding region, 7 polymorphisms were identified. Six of these polymorphisms occurred within protein-coding genes and three alter an amino acid sequence, one occurred in a tRNA-Ile sequence, and four were found to vary within our isolate. Thus, comparing only two isolates of T. spiralis has enabled the discovery of previously unrecognized variation within the species. Characterizing diversity within and among the mitochondrial genomes of additional species of Trichinella would undoubtedly yield further insights into the diversification history of the genus. Our study affirms that next generation DNA sequencing technology can reliably characterize a complete mitochondrial genome.
Collapse
Affiliation(s)
- Kristen M Webb
- Animal Parasitic Diseases Laboratory, Agricultural Research Service, United States Department of Agriculture, Building 1180, BARC-East, Beltsville, MD 20705, USA
| | | |
Collapse
|
36
|
Kingsford C, Schatz MC, Pop M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 2010; 11:21. [PMID: 20064276 PMCID: PMC2821320 DOI: 10.1186/1471-2105-11-21] [Citation(s) in RCA: 85] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2009] [Accepted: 01/12/2010] [Indexed: 01/08/2023] Open
Abstract
Background De Bruijn graphs are a theoretical framework underlying several modern genome assembly programs, especially those that deal with very short reads. We describe an application of de Bruijn graphs to analyze the global repeat structure of prokaryotic genomes. Results We provide the first survey of the repeat structure of a large number of genomes. The analysis gives an upper-bound on the performance of genome assemblers for de novo reconstruction of genomes across a wide range of read lengths. Further, we demonstrate that the majority of genes in prokaryotic genomes can be reconstructed uniquely using very short reads even if the genomes themselves cannot. The non-reconstructible genes are overwhelmingly related to mobile elements (transposons, IS elements, and prophages). Conclusions Our results improve upon previous studies on the feasibility of assembly with short reads and provide a comprehensive benchmark against which to compare the performance of the short-read assemblers currently being developed.
Collapse
Affiliation(s)
- Carl Kingsford
- Department of Computer Science, Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA.
| | | | | |
Collapse
|
37
|
Zerbino DR, McEwen GK, Margulies EH, Birney E. Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 2009; 4:e8407. [PMID: 20027311 PMCID: PMC2793427 DOI: 10.1371/journal.pone.0008407] [Citation(s) in RCA: 150] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2009] [Accepted: 10/21/2009] [Indexed: 11/22/2022] Open
Abstract
Background Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are an obstacle to large assemblies. Principal Findings We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths (N50) of above 1 Mbp in Bacteria and above 100 kbp in more complex organisms. Using real datasets we obtained a 96 kbp N50 in Pseudomonas syringae and a unique 147 kbp scaffold of a ferret BAC clone. We also present an efficient algorithm called Rock Band for the resolution of repeats in the case of mixed length assemblies, where different sequencing platforms are combined to obtain a cost-effective assembly. Conclusions These algorithms extend the utility of short read only assemblies into large complex genomes. They have been implemented and made available within the open-source Velvet short-read de novo assembler.
Collapse
Affiliation(s)
- Daniel R Zerbino
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
| | | | | | | |
Collapse
|
38
|
Wendl MC, Wilson RK. The theory of discovering rare variants via DNA sequencing. BMC Genomics 2009; 10:485. [PMID: 19843339 PMCID: PMC2778663 DOI: 10.1186/1471-2164-10-485] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2009] [Accepted: 10/20/2009] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Rare population variants are known to have important biomedical implications, but their systematic discovery has only recently been enabled by advances in DNA sequencing. The design process of a discovery project remains formidable, being limited to ad hoc mixtures of extensive computer simulation and pilot sequencing. Here, the task is examined from a general mathematical perspective. RESULTS We pose and solve the population sequencing design problem and subsequently apply standard optimization techniques that maximize the discovery probability. Emphasis is placed on cases whose discovery thresholds place them within reach of current technologies. We find that parameter values characteristic of rare-variant projects lead to a general, yet remarkably simple set of optimization rules. Specifically, optimal processing occurs at constant values of the per-sample redundancy, refuting current notions that sample size should be selected outright. Optimal project-wide redundancy and sample size are then shown to be inversely proportional to the desired variant frequency. A second family of constants governs these relationships, permitting one to immediately establish the most efficient settings for a given set of discovery conditions. Our results largely concur with the empirical design of the Thousand Genomes Project, though they furnish some additional refinement. CONCLUSION The optimization principles reported here dramatically simplify the design process and should be broadly useful as rare-variant projects become both more important and routine in the future.
Collapse
Affiliation(s)
- Michael C Wendl
- The Genome Center and Department of Genetics, Washington University, St. Louis MO 63108, USA
| | - Richard K Wilson
- The Genome Center and Department of Genetics, Washington University, St. Louis MO 63108, USA
| |
Collapse
|
39
|
|
40
|
ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet 2009. [PMID: 19736561 DOI: 10.1038/nrg2641,+10.1038/ni0709-669] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a technique for genome-wide profiling of DNA-binding proteins, histone modifications or nucleosomes. Owing to the tremendous progress in next-generation sequencing technology, ChIP-seq offers higher resolution, less noise and greater coverage than its array-based predecessor ChIP-chip. With the decreasing cost of sequencing, ChIP-seq has become an indispensable tool for studying gene regulation and epigenetic mechanisms. In this Review, I describe the benefits and challenges in harnessing this technique with an emphasis on issues related to experimental design and data analysis. ChIP-seq experiments generate large quantities of data, and effective computational analysis will be crucial for uncovering biological mechanisms.
Collapse
|
41
|
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a technique for genome-wide profiling of DNA-binding proteins, histone modifications or nucleosomes. Owing to the tremendous progress in next-generation sequencing technology, ChIP-seq offers higher resolution, less noise and greater coverage than its array-based predecessor ChIP-chip. With the decreasing cost of sequencing, ChIP-seq has become an indispensable tool for studying gene regulation and epigenetic mechanisms. In this Review, I describe the benefits and challenges in harnessing this technique with an emphasis on issues related to experimental design and data analysis. ChIP-seq experiments generate large quantities of data, and effective computational analysis will be crucial for uncovering biological mechanisms.
Collapse
|
42
|
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a technique for genome-wide profiling of DNA-binding proteins, histone modifications or nucleosomes. Owing to the tremendous progress in next-generation sequencing technology, ChIP-seq offers higher resolution, less noise and greater coverage than its array-based predecessor ChIP-chip. With the decreasing cost of sequencing, ChIP-seq has become an indispensable tool for studying gene regulation and epigenetic mechanisms. In this Review, I describe the benefits and challenges in harnessing this technique with an emphasis on issues related to experimental design and data analysis. ChIP-seq experiments generate large quantities of data, and effective computational analysis will be crucial for uncovering biological mechanisms.
Collapse
|
43
|
Gibbons JG, Janson EM, Hittinger CT, Johnston M, Abbot P, Rokas A. Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics. Mol Biol Evol 2009; 26:2731-44. [PMID: 19706727 DOI: 10.1093/molbev/msp188] [Citation(s) in RCA: 129] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Next-generation sequencing has opened the door to genomic analysis of nonmodel organisms. Technologies generating long-sequence reads (200-400 bp) are increasingly used in evolutionary studies of nonmodel organisms, but the short-sequence reads (30-50 bp) that can be produced at lower cost are thought to be of limited utility for de novo sequencing applications. Here, we tested this assumption by short-read sequencing the transcriptomes of the tropical disease vectors Aedes aegypti and Anopheles gambiae, for which complete genome sequences are available. Comparison of our results to the reference genomes allowed us to accurately evaluate the quantity, quality, and functional and evolutionary information content of our "test" data. We produced more than 0.7 billion nucleotides of sequenced data per species that assembled into more than 21,000 test contigs larger than 100 bp per species and covered approximately 27% of the Aedes reference transcriptome. Remarkably, the substitution error rate in the test contigs was approximately 0.25% per site, with very few indels or assembly errors. Test contigs of both species were enriched for genes involved in energy production and protein synthesis and underrepresented in genes involved in transcription and differentiation. Ortholog prediction using the test contigs was accurate across hundreds of millions of years of evolution. Our results demonstrate the considerable utility of short-read transcriptome sequencing for genomic studies of nonmodel organisms and suggest an approach for assessing the information content of next-generation data for evolutionary studies.
Collapse
Affiliation(s)
- John G Gibbons
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA
| | | | | | | | | | | |
Collapse
|
44
|
Amaral AJ, Megens HJ, Kerstens HHD, Heuven HCM, Dibbits B, Crooijmans RPMA, den Dunnen JT, Groenen MAM. Application of massive parallel sequencing to whole genome SNP discovery in the porcine genome. BMC Genomics 2009; 10:374. [PMID: 19674453 PMCID: PMC2739861 DOI: 10.1186/1471-2164-10-374] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2009] [Accepted: 08/12/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Although the Illumina 1 G Genome Analyzer generates billions of base pairs of sequence data, challenges arise in sequence selection due to the varying sequence quality. Therefore, in the framework of the International Porcine SNP Chip Consortium, this pilot study aimed to evaluate the impact of the quality level of the sequenced bases on mapping quality and identification of true SNPs on a large scale. RESULTS DNA pooled from five animals from a commercial boar line was digested with DraI; 150-250-bp fragments were isolated and end-sequenced using the Illumina 1 G Genome Analyzer, yielding 70,348,064 sequences 36-bp long. Rules were developed to select sequences, which were then aligned to unique positions in a reference genome. Sequences were selected based on quality, and three thresholds of sequence quality (SQ) were compared. The highest threshold of SQ allowed identification of a larger number of SNPs (17,489), distributed widely across the pig genome. In total, 3,142 SNPs were validated with a success rate of 96%. The correlation between estimated minor allele frequency (MAF) and genotyped MAF was moderate, and SNPs were highly polymorphic in other pig breeds. Lowering the SQ threshold and maintaining the same criteria for SNP identification resulted in the discovery of fewer SNPs (16,768), of which 259 were not identified using higher SQ levels. Validation of SNPs found exclusively in the lower SQ threshold had a success rate of 94% and a low correlation between estimated MAF and genotyped MAF. Base change analysis suggested that the rate of transitions in the pig genome is likely to be similar to that observed in humans. Chromosome X showed reduced nucleotide diversity relative to autosomes, as observed for other species. CONCLUSION Large numbers of SNPs can be identified reliably by creating strict rules for sequence selection, which simultaneously decreases sequence ambiguity. Selection of sequences using a higher SQ threshold leads to more reliable identification of SNPs. Lower SQ thresholds can be used to guarantee sufficient sequence coverage, resulting in high success rate but less reliable MAF estimation. Nucleotide diversity varies between porcine chromosomes, with the X chromosome showing less variation as observed in other species.
Collapse
Affiliation(s)
- Andreia J Amaral
- Animal Breeding and Genomics Centre, Wageningen University, Wageningen 6700 AH, The Netherlands.
| | | | | | | | | | | | | | | |
Collapse
|
45
|
Su Y, Lin L, Tian G, Chen C, Liu T, Xu X, Qi X, Zhang X, Yang H. Preparing a re-sequencing DNA library of 2 cancer candidate genes using the ligation-by-amplification protocol by two PCR reactions. ACTA ACUST UNITED AC 2009; 52:483-91. [PMID: 19471873 DOI: 10.1007/s11427-009-0066-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2008] [Accepted: 11/18/2008] [Indexed: 01/03/2023]
Abstract
To meet the needs of large-scale genomic/genetic studies, the next-generation massively parallelized sequencing technologies provide high throughput, low cost and low labor-intensive sequencing service, with subsequent bioinformatic software and laboratory methods developed to expand their applications in various types of research. PCR-based genomic/genetic studies, which have significant usage in association studies like cancer research, haven't benefited much from those next-generation sequencing technologies, because the shortgun re-sequencing strategy used by such sequencing machines as the Illumina/Solexa Genome Analyzer may not be applied to direct re-sequencing of short-length target regions like those in PCR-based genomic/genetic studies. Although several methods have been proposed to solve this problem, including microarray-based genomic selections and selector-based technologies, they require advanced equipment and procedures which limit their applications in many laboratories. By contrast, we overcame such potential drawbacks by utilizing a ligation by amplification (LBA) protocol, a method using a pair of Universal Adapters to randomly ligate target regions in a two-step-PCR procedure, whose Long LBA products were easily fragmented and sequenced on the next-generation sequencing machine. In this concept-proven study, we chose the consensus coding sequences of two human cancer genes: BRCA1 and BRCA2 as target regions, specifically designed LBA primer pairs to amplify and randomly ligate them. 70 target sequences were successfully amplified and ligated into Long LBA products, which were then fragmented to construct DNA libraries for sequencing on both a conventional Sanger sequencer ABI 3730xl DNA Analyzer and the next-generation 'synthesis by sequencing technology' Illumina/Solexa Genome Analyzer. Bioinformatic analysis demonstrated the utility and efficiency (including the coverage and depth of each target sequence and the SNPs detection effectiveness) of using the LBA protocol in facilitating PCR-based re-sequencing and genetic-variant-detection studies on the next-generation sequencing machine, raising the prospect of various PCR-based genomic/genetic studies using this strategy.
Collapse
Affiliation(s)
- Yeyang Su
- Graduate School of Chinese Academy of Sciences, Beijing, 100049, China
| | | | | | | | | | | | | | | | | |
Collapse
|
46
|
Identification of EMS-induced mutations in Drosophila melanogaster by whole-genome sequencing. Genetics 2009; 182:25-32. [PMID: 19307605 DOI: 10.1534/genetics.109.101998] [Citation(s) in RCA: 105] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Next-generation methods for rapid whole-genome sequencing enable the identification of single-base-pair mutations in Drosophila by comparing a chromosome bearing a new mutation to the unmutagenized sequence. To validate this approach, we sought to identify the molecular lesion responsible for a recessive EMS-induced mutation affecting egg shell morphology by using Illumina next-generation sequencing. After obtaining sufficient sequence from larvae that were homozygous for either wild-type or mutant chromosomes, we obtained high-quality reads for base pairs composing approximately 70% of the third chromosome of both DNA samples. We verified 103 single-base-pair changes between the two chromosomes. Nine changes were nonsynonymous mutations and two were nonsense mutations. One nonsense mutation was in a gene, encore, whose mutations produce an egg shell phenotype also observed in progeny of homozygous mutant mothers. Complementation analysis revealed that the chromosome carried a new functional allele of encore, demonstrating that one round of next-generation sequencing can identify the causative lesion for a phenotype of interest. This new method of whole-genome sequencing represents great promise for mutant mapping in flies, potentially replacing conventional methods.
Collapse
|
47
|
Voelkerding KV, Dames SA, Durtschi JD. Next-generation sequencing: from basic research to diagnostics. Clin Chem 2009; 55:641-58. [PMID: 19246620 DOI: 10.1373/clinchem.2008.112789] [Citation(s) in RCA: 544] [Impact Index Per Article: 36.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
BACKGROUND For the past 30 years, the Sanger method has been the dominant approach and gold standard for DNA sequencing. The commercial launch of the first massively parallel pyrosequencing platform in 2005 ushered in the new era of high-throughput genomic analysis now referred to as next-generation sequencing (NGS). CONTENT This review describes fundamental principles of commercially available NGS platforms. Although the platforms differ in their engineering configurations and sequencing chemistries, they share a technical paradigm in that sequencing of spatially separated, clonally amplified DNA templates or single DNA molecules is performed in a flow cell in a massively parallel manner. Through iterative cycles of polymerase-mediated nucleotide extensions or, in one approach, through successive oligonucleotide ligations, sequence outputs in the range of hundreds of megabases to gigabases are now obtained routinely. Highlighted in this review are the impact of NGS on basic research, bioinformatics considerations, and translation of this technology into clinical diagnostics. Also presented is a view into future technologies, including real-time single-molecule DNA sequencing and nanopore-based sequencing. SUMMARY In the relatively short time frame since 2005, NGS has fundamentally altered genomics research and allowed investigators to conduct experiments that were previously not technically feasible or affordable. The various technologies that constitute this new paradigm continue to evolve, and further improvements in technology robustness and process streamlining will pave the path for translation into clinical diagnostics.
Collapse
Affiliation(s)
- Karl V Voelkerding
- ARUP Institute for Experimental and Clinical Pathology, Salt Lake City, Utah 84108, USA.
| | | | | |
Collapse
|
48
|
Rokas A, Abbot P. Harnessing genomics for evolutionary insights. Trends Ecol Evol 2009; 24:192-200. [PMID: 19201503 DOI: 10.1016/j.tree.2008.11.004] [Citation(s) in RCA: 116] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2008] [Revised: 11/07/2008] [Accepted: 11/10/2008] [Indexed: 11/25/2022]
Abstract
Next-generation DNA sequencing technologies can generate unprecedented amounts of genomic data, even for non-model organisms. Here we describe how these new technologies have facilitated recent key advances in ecology and evolutionary biology, and highlight several outstanding ecological and evolutionary questions that are distinctly suited to the innovations they provide. Importantly, using these technologies to their full potential requires careful experimental design and critical consideration of several caveats associated with them. Although several significant challenges remain to be resolved before the integration of next-generation sequencing technologies into single-investigator research programs, we argue that they will soon transform ecology and evolution by fundamentally changing the ranges and types of questions that can be addressed.
Collapse
Affiliation(s)
- Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, VU Station B 35-1634, Nashville, TN 37235, USA.
| | | |
Collapse
|
49
|
Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 2009; 27:66-75. [PMID: 19122651 DOI: 10.1038/nbt.1518] [Citation(s) in RCA: 466] [Impact Index Per Article: 31.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2008] [Accepted: 12/03/2008] [Indexed: 01/23/2023]
Abstract
Chromatin immunoprecipitation (ChIP) followed by tag sequencing (ChIP-seq) using high-throughput next-generation instrumentation is fast, replacing chromatin immunoprecipitation followed by genome tiling array analysis (ChIP-chip) as the preferred approach for mapping of sites of transcription-factor binding and chromatin modification. Using two deeply sequenced data sets for human RNA polymerase II and STAT1, each with matching input-DNA controls, we describe a general scoring approach to address unique challenges in ChIP-seq data analysis. Our approach is based on the observation that sites of potential binding are strongly correlated with signal peaks in the control, likely revealing features of open chromatin. We develop a two-pass strategy called PeakSeq to compensate for this. A two-pass strategy compensates for signal caused by open chromatin, as revealed by inclusion of the controls. The first pass identifies putative binding sites and compensates for genomic variation in the 'mappability' of sequences. The second pass filters out sites not significantly enriched compared to the normalized control, computing precise enrichments and significances. Our scoring procedure enables us to optimize experimental design by estimating the depth of sequencing required for a desired level of coverage and demonstrating that more than two replicates provides only a marginal gain in information.
Collapse
Affiliation(s)
- Joel Rozowsky
- Molecular Biophysics & Biochemistry Dept., Yale University, PO Box 208114, New Haven, Connecticut 06520-8114, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
50
|
|