1
|
Putative extremely high rate of proteome innovation in lancelets might be explained by high rate of gene prediction errors. Sci Rep 2016; 6:30700. [PMID: 27476717 PMCID: PMC4967905 DOI: 10.1038/srep30700] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2016] [Accepted: 07/06/2016] [Indexed: 01/17/2023] Open
Abstract
A recent analysis of the genomes of Chinese and Florida lancelets has concluded that the rate of creation of novel protein domain combinations is orders of magnitude greater in lancelets than in other metazoa and it was suggested that continuous activity of transposable elements in lancelets is responsible for this increased rate of protein innovation. Since morphologically Chinese and Florida lancelets are highly conserved, this finding would contradict the observation that high rates of protein innovation are usually associated with major evolutionary innovations. Here we show that the conclusion that the rate of proteome innovation is exceptionally high in lancelets may be unjustified: the differences observed in domain architectures of orthologous proteins of different amphioxus species probably reflect high rates of gene prediction errors rather than true innovation.
Collapse
|
2
|
Schurch NJ, Cole C, Sherstnev A, Song J, Duc C, Storey KG, McLean WHI, Brown SJ, Simpson GG, Barton GJ. Improved annotation of 3' untranslated regions and complex loci by combination of strand-specific direct RNA sequencing, RNA-Seq and ESTs. PLoS One 2014; 9:e94270. [PMID: 24722185 PMCID: PMC3983147 DOI: 10.1371/journal.pone.0094270] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2013] [Accepted: 03/13/2014] [Indexed: 11/23/2022] Open
Abstract
The reference annotations made for a genome sequence provide the framework for all subsequent analyses of the genome. Correct and complete annotation in addition to the underlying genomic sequence is particularly important when interpreting the results of RNA-seq experiments where short sequence reads are mapped against the genome and assigned to genes according to the annotation. Inconsistencies in annotations between the reference and the experimental system can lead to incorrect interpretation of the effect on RNA expression of an experimental treatment or mutation in the system under study. Until recently, the genome-wide annotation of 3′ untranslated regions received less attention than coding regions and the delineation of intron/exon boundaries. In this paper, data produced for samples in Human, Chicken and A. thaliana by the novel single-molecule, strand-specific, Direct RNA Sequencing technology from Helicos Biosciences which locates 3′ polyadenylation sites to within +/− 2 nt, were combined with archival EST and RNA-Seq data. Nine examples are illustrated where this combination of data allowed: (1) gene and 3′ UTR re-annotation (including extension of one 3′ UTR by 5.9 kb); (2) disentangling of gene expression in complex regions; (3) clearer interpretation of small RNA expression and (4) identification of novel genes. While the specific examples displayed here may become obsolete as genome sequences and their annotations are refined, the principles laid out in this paper will be of general use both to those annotating genomes and those seeking to interpret existing publically available annotations in the context of their own experimental data.
Collapse
Affiliation(s)
- Nicholas J. Schurch
- Division of Computational Biology, University of Dundee, Dundee, United Kingdom
- Division of Biological Chemistry and Drug Discovery, University of Dundee, Dundee, United Kingdom
- Centre for Gene Regulation and Expression, University of Dundee, Dundee, United Kingdom
| | - Christian Cole
- Division of Computational Biology, University of Dundee, Dundee, United Kingdom
- Division of Biological Chemistry and Drug Discovery, University of Dundee, Dundee, United Kingdom
- Centre for Gene Regulation and Expression, University of Dundee, Dundee, United Kingdom
| | - Alexander Sherstnev
- Division of Computational Biology, University of Dundee, Dundee, United Kingdom
| | - Junfang Song
- Division of Cell and Developmental Biology, University of Dundee, Dundee, United Kingdom
| | - Céline Duc
- Division of Plant Sciences, University of Dundee, Dundee, United Kingdom
| | - Kate G. Storey
- Division of Cell and Developmental Biology, University of Dundee, Dundee, United Kingdom
| | - W. H. Irwin McLean
- Centre for Dermatology and Genetic Medicine, University of Dundee, Dundee, United Kingdom
| | - Sara J. Brown
- Centre for Dermatology and Genetic Medicine, University of Dundee, Dundee, United Kingdom
| | - Gordon G. Simpson
- Division of Plant Sciences, University of Dundee, Dundee, United Kingdom
- Cell and Molecular Sciences, The James Hutton Institute, Dundee, United Kingdom
| | - Geoffrey J. Barton
- Division of Computational Biology, University of Dundee, Dundee, United Kingdom
- Division of Biological Chemistry and Drug Discovery, University of Dundee, Dundee, United Kingdom
- Centre for Gene Regulation and Expression, University of Dundee, Dundee, United Kingdom
- * E-mail:
| |
Collapse
|
3
|
Martinez M. From plant genomes to protein families: computational tools. Comput Struct Biotechnol J 2013; 8:e201307001. [PMID: 24688740 PMCID: PMC3962197 DOI: 10.5936/csbj.201307001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2013] [Revised: 07/05/2013] [Accepted: 07/10/2013] [Indexed: 01/28/2023] Open
Abstract
The development of new high-throughput sequencing technologies has increased dramatically the number of successful genomic projects. Thus, draft genomic sequences of more than 60 plant species are currently available. Suitable bioinformatics tools are being developed to assemble, annotate and analyze the enormous number of sequences produced. In this context, specific plant comparative genomic databases are become powerful tools for gene family annotation in plant clades. In this mini-review, the current state-of-art of genomic projects is glossed. Besides, the computational tools developed to compare genomic data are compiled.
Collapse
Affiliation(s)
- Manuel Martinez
- Centro de Biotecnología y Genómica de Plantas (UPM-INIA), Campus Montegancedo, Universidad Politécnica de Madrid, Autovía M40 (Km 38), 28223-Pozuelo de Alarcón, Madrid, Spain
| |
Collapse
|
4
|
Ehrenkaufer GM, Weedall GD, Williams D, Lorenzi HA, Caler E, Hall N, Singh U. The genome and transcriptome of the enteric parasite Entamoeba invadens, a model for encystation. Genome Biol 2013; 14:R77. [PMID: 23889909 PMCID: PMC4053983 DOI: 10.1186/gb-2013-14-7-r77] [Citation(s) in RCA: 75] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2013] [Accepted: 07/26/2013] [Indexed: 12/27/2022] Open
Abstract
Background Several eukaryotic parasites form cysts that transmit infection. The process is found in diverse organisms such as Toxoplasma, Giardia, and nematodes. In Entamoeba histolytica this process cannot be induced in vitro, making it difficult to study. In Entamoeba invadens, stage conversion can be induced, but its utility as a model system to study developmental biology has been limited by a lack of genomic resources. We carried out genome and transcriptome sequencing of E. invadens to identify molecular processes involved in stage conversion. Results We report the sequencing and assembly of the E. invadens genome and use whole transcriptome sequencing to characterize changes in gene expression during encystation and excystation. The E. invadens genome is larger than that of E. histolytica, apparently largely due to expansion of intergenic regions; overall gene number and the machinery for gene regulation are conserved between the species. Over half the genes are regulated during the switch between morphological forms and a key signaling molecule, phospholipase D, appears to regulate encystation. We provide evidence for the occurrence of meiosis during encystation, suggesting that stage conversion may play a key role in recombination between strains. Conclusions Our analysis demonstrates that a number of core processes are common to encystation between distantly related parasites, including meiosis, lipid signaling and RNA modification. These data provide a foundation for understanding the developmental cascade in the important human pathogen E. histolytica and highlight conserved processes more widely relevant in enteric pathogens.
Collapse
|
5
|
|
6
|
Reese MG, Moore B, Batchelor C, Salas F, Cunningham F, Marth GT, Stein L, Flicek P, Yandell M, Eilbeck K. A standard variation file format for human genome sequences. Genome Biol 2010; 11:R88. [PMID: 20796305 PMCID: PMC2945790 DOI: 10.1186/gb-2010-11-8-r88] [Citation(s) in RCA: 73] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2010] [Revised: 07/26/2010] [Accepted: 08/26/2010] [Indexed: 12/03/2022] Open
Abstract
Here we describe the Genome Variation Format (GVF) and the 10Gen dataset. GVF, an extension of Generic Feature Format version 3 (GFF3), is a simple tab-delimited format for DNA variant files, which uses Sequence Ontology to describe genome variation data. The 10Gen dataset, ten human genomes in GVF format, is freely available for community analysis from the Sequence Ontology website and from an Amazon elastic block storage (EBS) snapshot for use in Amazon's EC2 cloud computing environment.
Collapse
Affiliation(s)
- Martin G Reese
- Omicia, 2200 Powell Street, Suite 525, Emeryville, CA 94608, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Quantitative measures for the management and comparison of annotated genomes. BMC Bioinformatics 2009; 10:67. [PMID: 19236712 PMCID: PMC2653490 DOI: 10.1186/1471-2105-10-67] [Citation(s) in RCA: 91] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2008] [Accepted: 02/23/2009] [Indexed: 11/22/2022] Open
Abstract
Background The ever-increasing number of sequenced and annotated genomes has made management of their annotations a significant undertaking, especially for large eukaryotic genomes containing many thousands of genes. Typically, changes in gene and transcript numbers are used to summarize changes from release to release, but these measures say nothing about changes to individual annotations, nor do they provide any means to identify annotations in need of manual review. Results In response, we have developed a suite of quantitative measures to better characterize changes to a genome's annotations between releases, and to prioritize problematic annotations for manual review. We have applied these measures to the annotations of five eukaryotic genomes over multiple releases – H. sapiens, M. musculus, D. melanogaster, A. gambiae, and C. elegans. Conclusion Our results provide the first detailed, historical overview of how these genomes' annotations have changed over the years, and demonstrate the usefulness of these measures for genome annotation management.
Collapse
|
8
|
Wilming L, Harrow J. Gene Annotation Methods. Bioinformatics 2009. [DOI: 10.1007/978-0-387-92738-1_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
|
9
|
Clark TG, Andrew T, Cooper GM, Margulies EH, Mullikin JC, Balding DJ. Functional constraint and small insertions and deletions in the ENCODE regions of the human genome. Genome Biol 2008; 8:R180. [PMID: 17784950 PMCID: PMC2375018 DOI: 10.1186/gb-2007-8-9-r180] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2006] [Revised: 09/04/2007] [Accepted: 09/04/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We describe the distribution of indels in the 44 Encyclopedia of DNA Elements (ENCODE) regions (about 1% of the human genome) and evaluate the potential contributions of small insertion and deletion polymorphisms (indels) to human genetic variation. We relate indels to known genomic annotation features and measures of evolutionary constraint. RESULTS Indel rates are observed to be reduced approximately 20-fold to 60-fold in exonic regions, 5-fold to 10-fold in sequence that exhibits high evolutionary constraint in mammals, and up to 2-fold in some classes of regulatory elements (for instance, formaldehyde assisted isolation of regulatory elements [FAIRE] and hypersensitive sites). In addition, some noncoding transcription and other chromatin mediated regulatory sites also have reduced indel rates. Overall indel rates for these data are estimated to be smaller than single nucleotide polymorphism (SNP) rates by a factor of approximately 2, with both rates measured as base pairs per 100 kilobases to facilitate comparison. CONCLUSION Indel rates exhibit a broadly similar distribution across genomic features compared with SNP density rates, with a reduction in rates in coding transcription and evolutionarily constrained sequence. However, unlike indels, SNP rates do not appear to be reduced in some noncoding functional sequences, such as pseudo-exons, and FAIRE and hypersensitive sites. We conclude that indel rates are greatly reduced in transcribed and evolutionarily constrained DNA, and discuss why indel (but not SNP) rates appear to be constrained at some regulatory sites.
Collapse
Affiliation(s)
- Taane G Clark
- Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London, W2 1PG, UK
| | - Toby Andrew
- Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London, W2 1PG, UK
| | - Gregory M Cooper
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | - Elliott H Margulies
- National Human Genome Research Institute, National Institutes of Health, 9000 Rockville Pike, Bethesda, Maryland 20892, USA
| | - James C Mullikin
- National Human Genome Research Institute, National Institutes of Health, 9000 Rockville Pike, Bethesda, Maryland 20892, USA
| | - David J Balding
- Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London, W2 1PG, UK
| |
Collapse
|
10
|
Goñi JR, Pérez A, Torrents D, Orozco M. Determining promoter location based on DNA structure first-principles calculations. Genome Biol 2008; 8:R263. [PMID: 18072969 PMCID: PMC2246265 DOI: 10.1186/gb-2007-8-12-r263] [Citation(s) in RCA: 105] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2007] [Revised: 11/24/2007] [Accepted: 12/11/2007] [Indexed: 11/25/2022] Open
Abstract
A new method is presented which predicts promoter regions based on atomistic molecular dynamics simulations of small oligonucleotides, without requiring information on sequence conservation or features. A new method for the prediction of promoter regions based on atomic molecular dynamics simulations of small oligonucleotides has been developed. The method works independently of gene structure conservation and orthology and of the presence of detectable sequence features. Results obtained with our method confirm the existence of a hidden physical code that modulates genome expression.
Collapse
Affiliation(s)
- J Ramon Goñi
- Institute for Research in Biomedicine, Parc Científic de Barcelona, Josep Samitier, Barcelona 08028, Spain
| | | | | | | |
Collapse
|
11
|
Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR, Wortman JR. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol 2008; 9:R7. [PMID: 18190707 PMCID: PMC2395244 DOI: 10.1186/gb-2008-9-1-r7] [Citation(s) in RCA: 1858] [Impact Index Per Article: 116.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2007] [Revised: 12/17/2007] [Accepted: 01/11/2008] [Indexed: 01/16/2023] Open
Abstract
EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.
Collapse
Affiliation(s)
- Brian J Haas
- J Craig Venter Institute, The Institute for Genomic Research, Rockville, Maryland 20850, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|