1
|
Li H. Protein-to-genome alignment with miniprot. Bioinformatics 2023; 39:btad014. [PMID: 36648328 PMCID: PMC9869432 DOI: 10.1093/bioinformatics/btad014] [Citation(s) in RCA: 136] [Impact Index Per Article: 68.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 12/25/2022] [Accepted: 01/16/2023] [Indexed: 01/18/2023] Open
Abstract
MOTIVATION Protein-to-genome alignment is critical to annotating genes in non-model organisms. While there are a few tools for this purpose, all of them were developed over 10 years ago and did not incorporate the latest advances in alignment algorithms. They are inefficient and could not keep up with the rapid production of new genomes and quickly growing protein databases. RESULTS Here, we describe miniprot, a new aligner for mapping protein sequences to a complete genome. Miniprot integrates recent techniques such as k-mer sketch and vectorized dynamic programming. It is tens of times faster than existing tools while achieving comparable accuracy on real data. AVAILABILITY AND IMPLEMENTATION https://github.com/lh3/miniport.
Collapse
Affiliation(s)
- Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
2
|
Abstract
Newly sequenced genomes are being added to the tree of life at an unprecedented fast pace. Increasingly, such new genomes are phylogenetically close to previously sequenced and annotated genomes. In other cases, whole clades of closely related species or strains ought to be annotated simultaneously. Often, in subsequent studies differences between the closely related species or strains are in the focus of research when the shared gene structures prevail. We here review methods for comparative structural genome annotation. The reviewed methods include classical approaches such as the alignment of protein sequences or protein profiles against the genome and comparative gene prediction methods that exploit a genome alignment to annotate a target genome. Newer approaches such as the simultaneous annotation of multiple genomes are also reviewed. We discuss how the methods depend on the phylogenetic placement of genomes, give advice on the choice of methods, and examine the consistency between gene structure annotations in an example. Further, we provide practical advice on genome annotation in general.
Collapse
Affiliation(s)
- Stefanie König
- Institut für Mathematik und Informatik, Ernst Moritz Arndt Universität Greifswald, Greifswald, Germany
| | - Lars Romoth
- Institut für Mathematik und Informatik, Ernst Moritz Arndt Universität Greifswald, Greifswald, Germany
| | - Mario Stanke
- Institut für Mathematik und Informatik, Ernst Moritz Arndt Universität Greifswald, Greifswald, Germany.
| |
Collapse
|
3
|
A Comprehensive Review of Emerging Computational Methods for Gene Identification. JOURNAL OF INFORMATION PROCESSING SYSTEMS 2016. [DOI: 10.3745/jips.04.0023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
4
|
Differential pre-mRNA Splicing Alters the Transcript Diversity of Helitrons Between the Maize Inbred Lines. G3-GENES GENOMES GENETICS 2015; 5:1703-11. [PMID: 26070844 PMCID: PMC4528327 DOI: 10.1534/g3.115.018630] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The propensity to capture and mobilize gene fragments by the highly abundant Helitron family of transposable elements likely impacts the evolution of genes in Zea mays. These elements provide a substrate for natural selection by giving birth to chimeric transcripts by intertwining exons of disparate genes. They also capture flanking exons by read-through transcription. Here, we describe the expression of selected Helitrons in different maize inbred lines. We recently reported that these Helitrons produce multiple isoforms of transcripts in inbred B73 via alternative splicing. Despite sharing high degrees of sequence similarity, the splicing profile of Helitrons differed among various maize inbred lines. The comparison of Helitron sequences identified unique polymorphisms in inbred B73, which potentially give rise to the alternatively spliced sites utilized by transcript isoforms. Some alterations in splicing, however, do not have obvious explanations. These observations not only add another level to the creation of transcript diversity by Helitrons among inbred lines but also provide novel insights into the cis-acting elements governing splice-site selection during pre-mRNA processing.
Collapse
|
5
|
Rauch HB, Patrick TL, Klusman KM, Battistuzzi FU, Mei W, Brendel VP, Lal SK. Discovery and expression analysis of alternative splicing events conserved among plant SR proteins. Mol Biol Evol 2013; 31:605-13. [PMID: 24356560 DOI: 10.1093/molbev/mst238] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
The high frequency of alternative splicing among the serine/arginine-rich (SR) family of proteins in plants has been linked to important roles in gene regulation during development and in response to environmental stress. In this article, we have searched and manually annotated all the SR proteins in the genomes of maize and sorghum. The experimental validation of gene structure by reverse transcription-polymerase chain reaction (RT-PCR) analysis revealed, with few exceptions, that SR genes produced multiple isoforms of transcripts by alternative splicing. Despite sharing high structural similarity and conserved positions of the introns, the profile of alternative splicing diverged significantly between maize and sorghum for the vast majority of SR genes. These include many transcript isoforms discovered by RT-PCR and not represented in extant expressed sequence tag (EST) collection. However, we report the occurrence of various maize and sorghum SR mRNA isoforms that display evolutionary conservation of splicing events with their homologous SR genes in Arabidopsis and moss. Our data also indicate an important role of both 5' and 3' untranslated regions in the regulation of SR gene expression. These observations have potentially important implications for the processes of evolution and adaptation of plants to land.
Collapse
|
6
|
|
7
|
Abstract
Helitrons are a family of mobile elements that were discovered in 2001 and are now known to exist in the entire eukaryotic kingdom. Helitrons, particularly those of maize, exhibit an intriguing property of capturing gene fragments and placing them into the mobile element. Helitron-captured genes are sometimes transcribed, giving birth to chimeric transcripts that intertwine coding regions of different captured genes. Here, we perused the B73 maize genome for high-quality, putative Helitrons that exhibit plus/minus polymorphisms and contain pieces of more than one captured gene. Selected Helitrons were monitored for expression via in silico EST analysis. Intriguingly, expression validation of selected elements by RT–PCR analysis revealed multiple transcripts not seen in the EST databases. The differing transcripts were generated by alternative selection of splice sites during pre-mRNA processing. Selection of splice sites was not random since different patterns of splicing were observed in the root and shoot tissues. In one case, an exon residing in close proximity but outside of the Helitron was found conjoined with Helitron-derived exons in the mature transcript. Hence, Helitrons have the ability to synthesize new genes not only by placing unrelated exons into common transcripts, but also by transcription readthrough and capture of nearby exons. Thus, Helitrons have a phenomenal ability to “display” new coding regions for possible selection in nature. A highly conservative, minimum estimate of the number of new transcripts expressed by Helitrons is ∼11,000 or ∼25% of the total number of genes in the maize genome.
Collapse
|
8
|
Inagaki YS, Etherington G, Geisler K, Field B, Dokarry M, Ikeda K, Mutsukado Y, Dicks J, Osbourn A. Investigation of the potential for triterpene synthesis in rice through genome mining and metabolic engineering. THE NEW PHYTOLOGIST 2011; 191:432-448. [PMID: 21501172 DOI: 10.1111/j.1469-8137.2011.03712.x] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
The first committed step in sterol biosynthesis in plants involves the cyclization of 2,3-oxidosqualene by the oxidosqualene cyclase (OSC) enzyme cycloartenol synthase. 2,3-Oxidosqualene is also a precursor for triterpene synthesis. Antimicrobial triterpenes are common in dicots, but seldom found in monocots, with the notable exception of oat. Here, through genome mining and metabolic engineering, we investigate the potential for triterpene synthesis in rice. The first two steps in the oat triterpene pathway are catalysed by a divergent OSC (AsbAS1) and a cytochrome P450 (CYP51). The genes for these enzymes form part of a metabolic gene cluster. To investigate the origins of triterpene synthesis in monocots, we analysed systematically the OSC and CYP51 gene families in rice. We also engineered rice for elevated triterpene content. We discovered a total of 12 OSC and 12 CYP51 genes in rice and uncovered key events in the evolution of triterpene synthesis. We further showed that the expression of AsbAS1 in rice leads to the accumulation of the simple triterpene, β-amyrin. These findings provide new insights into the evolution of triterpene synthesis in monocots and open up opportunities for metabolic engineering for disease resistance in rice and other cereals.
Collapse
Affiliation(s)
- Yoshi-Shige Inagaki
- Department of Metabolic Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
- Plant Pathology and Genetic Engineering Laboratory, Faculty of Agriculture, Tsushiama-naka 1-1-1, Okayama University, Okayama 700-8530, Japan
| | - Graham Etherington
- Department of Computational and Systems Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
| | - Katrin Geisler
- Department of Metabolic Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
| | - Ben Field
- Department of Metabolic Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
| | - Melissa Dokarry
- Department of Metabolic Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
| | - Kousuke Ikeda
- Plant Pathology and Genetic Engineering Laboratory, Faculty of Agriculture, Tsushiama-naka 1-1-1, Okayama University, Okayama 700-8530, Japan
| | - Yukako Mutsukado
- Plant Pathology and Genetic Engineering Laboratory, Faculty of Agriculture, Tsushiama-naka 1-1-1, Okayama University, Okayama 700-8530, Japan
| | - Jo Dicks
- Department of Computational and Systems Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
| | - Anne Osbourn
- Department of Metabolic Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
| |
Collapse
|
9
|
Ahmed F, Benedito VA, Zhao PX. Mining Functional Elements in Messenger RNAs: Overview, Challenges, and Perspectives. FRONTIERS IN PLANT SCIENCE 2011; 2:84. [PMID: 22639614 PMCID: PMC3355573 DOI: 10.3389/fpls.2011.00084] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/31/2011] [Accepted: 11/03/2011] [Indexed: 05/03/2023]
Abstract
Eukaryotic messenger RNA (mRNA) contains not only protein-coding regions but also a plethora of functional cis-elements that influence or coordinate a number of regulatory aspects of gene expression, such as mRNA stability, splicing forms, and translation rates. Understanding the rules that apply to each of these element types (e.g., whether the element is defined by primary or higher-order structure) allows for the discovery of novel mechanisms of gene expression as well as the design of transcripts with controlled expression. Bioinformatics plays a major role in creating databases and finding non-evident patterns governing each type of eukaryotic functional element. Much of what we currently know about mRNA regulatory elements in eukaryotes is derived from microorganism and animal systems, with the particularities of plant systems lagging behind. In this review, we provide a general introduction to the most well-known eukaryotic mRNA regulatory motifs (splicing regulatory elements, internal ribosome entry sites, iron-responsive elements, AU-rich elements, zipcodes, and polyadenylation signals) and describe available bioinformatics resources (databases and analysis tools) to analyze eukaryotic transcripts in search of functional elements, focusing on recent trends in bioinformatics methods and tool development. We also discuss future directions in the development of better computational tools based upon current knowledge of these functional elements. Improved computational tools would advance our understanding of the processes underlying gene regulations. We encourage plant bioinformaticians to turn their attention to this subject to help identify novel mechanisms of gene expression regulation using RNA motifs that have potentially evolved or diverged in plant species.
Collapse
Affiliation(s)
- Firoz Ahmed
- Bioinformatics Laboratory, Plant Biology Division, Samuel Roberts Noble FoundationArdmore, OK, USA
| | - Vagner A. Benedito
- Genetics and Developmental Biology, Plant and Soil Sciences Division, West Virginia UniversityMorgantown, WV, USA
| | - Patrick Xuechun Zhao
- Bioinformatics Laboratory, Plant Biology Division, Samuel Roberts Noble FoundationArdmore, OK, USA
- *Correspondence: Patrick Xuechun Zhao, Bioinformatics Laboratory, Plant Biology Division, Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK 73401, USA e-mail:
| |
Collapse
|
10
|
Abstract
High-throughput DNA sequencing is increasing the amount of public complete genomes even though a precise gene catalogue for each organism is not yet available. In this context, computational gene finders play a key role in producing a first and cost-effective annotation. Nowadays a compilation of gene prediction tools has been made available to the scientific community and, despite the high number, they can be divided into two main categories: (1) ab initio and (2) evidence based. In the following, we will provide an overview of main methodologies to predict correct exon-intron structures of eukaryotic genes falling in such categories. We will take into account also new strategies that commonly refine ab initio predictions employing comparative genomics or other evidence such as expression data. Finally, we will briefly introduce metrics to in house evaluation of gene predictions in terms of sensitivity and specificity at nucleotide, exon, and gene levels as well.
Collapse
Affiliation(s)
- Ernesto Picardi
- Dipartimento di Biochimica e Biologia Molecolare E Quagliariello, University of Bari, Bari, Italy
| | | |
Collapse
|
11
|
Labate JA, Robertson LD, Wu F, Tanksley SD, Baldo AM. EST, COSII, and arbitrary gene markers give similar estimates of nucleotide diversity in cultivated tomato (Solanum lycopersicum L.). TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2009; 118:1005-14. [PMID: 19153710 DOI: 10.1007/s00122-008-0957-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2008] [Accepted: 12/20/2008] [Indexed: 05/13/2023]
Abstract
Because cultivated tomato (Solanum lycopersicum L.) is low in genetic diversity, public, verified single nucleotide polymorphism (SNP) markers within the species are in demand. To promote marker development we resequenced approximately 23 kb in a diverse set of 31 tomato lines including TA496. Three classes of markers were sampled: (1) 26 expressed-sequence tag (EST), all of which were predicted to be polymorphic based on TA496, (2) 14 conserved ortholog set II (COSII) or unigene, and (3) ten published sequences, composed of nine fruit quality genes and one anonymous RFLP marker. The latter two types contained mostly noncoding DNA. In total, 154 SNPs and 34 indels were observed. The distributions of nucleotide diversity estimates among marker types were not significantly different from each other. Ascertainment bias of SNPs was evaluated for the EST markers. Despite the fact that the EST markers were developed using SNP prediction within a sample consisting of only one TA496 allele and one additional allele, the majority of polymorphisms in the 26 EST markers were represented among the other 30 tomato lines. Fifteen EST markers with published SNPs were more closely examined for bias. Mean SNP diversity observations were not significantly different between the original discovery sample of two lines (53 SNPs) and the 31 line diversity panel (56 SNPs). Furthermore, TA496 shared its haplotype with at least one other line at 11 of the 15 markers. These data demonstrate that public EST databases and noncoding regions are a valuable source of unbiased SNP markers in tomato.
Collapse
Affiliation(s)
- Joanne A Labate
- USDA-ARS Plant Genetic Resources Unit, 630 W. North Street, Geneva, NY 14456, USA.
| | | | | | | | | |
Collapse
|
12
|
Abstract
MOTIVATION Finding protein-coding genes in a newly determined genomic sequence is the first step toward understanding the content written in the genome. Sequences of transcripts of homologous genes, if available, can considerably improve accuracy of prediction of genes and their structures, compared with that without such knowledge. As protein sequences are generally better conserved than nucleotide sequences, remote homologs can be used as templates, extending the applicability of evidence-based gene recognition methods. However, no tool seems to have been developed so far to simultaneously map and align a number of protein sequences on mammalian-sized genomic sequence. RESULTS We have extended our computer program Spaln to accept protein sequences, as well as cDNA sequences, as queries. When the query and the target sequences are reasonably similar, e.g. between mammalian orthologs, Spaln runs one to two orders of magnitude faster than conventional approaches that rely on Blast search followed by dynamic-programming-based spliced alignment. Exon-level and gene-level accuracies of Spaln are significantly higher than those obtained by the best available methods of the same type, particularly when the query and the target are distantly related. AVAILABILITY Spaln is accessible online for a few species at http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user. The source code is available for free for academic users from the same site.
Collapse
Affiliation(s)
- Osamu Gotoh
- Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Yoshida Honmachi, Sakyo-ku, Kyoto 606-8501, Japan.
| |
Collapse
|
13
|
Tasma IM, Brendel V, Whitham SA, Bhattacharyya MK. Expression and evolution of the phosphoinositide-specific phospholipase C gene family in Arabidopsis thaliana. PLANT PHYSIOLOGY AND BIOCHEMISTRY : PPB 2008; 46:627-637. [PMID: 18534862 DOI: 10.1016/j.plaphy.2008.04.015] [Citation(s) in RCA: 83] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/13/2007] [Indexed: 05/04/2023]
Abstract
Phosphoinositide-specific phospholipase C cleaves the substrate phosphatidylinositol 4,5-bisphosphate and generates inositol 1,4,5-trisphosphate and 1,2-diacylglycerol, both of which are second messengers in the phosphoinositide signal transduction pathways operative in animal cells. Five PI-PLC isoforms, beta, gamma, delta, epsilon and zeta, have been identified in mammals. Plant PI-PLCs are structurally close to the mammalian PI-PLC-zeta isoform. The Arabidopsis genome contains nine AtPLC genes. Expression patterns of all nine genes in different organs and in response to various environmental stimuli were studied by applying a quantitative RT-PCR approach. Multiple members of the gene family were differentially expressed in Arabidopsis organs, suggesting putative roles for this enzyme in plant development, including tissue and organ differentiation. This study also shows that a majority of the AtPLC genes are induced in response to various environmental stimuli, including cold, salt, nutrients Murashige-Skoog salts, dehydration, and the plant hormone abscisic acid. Results of this and previous studies strongly suggest that transcriptional activation of the PI-PLC gene family is important for adapting plants to stress environments. Expression patterns and phylogenetic relationships indicates that AtPLC gene members probably evolved through multiple rounds of gene duplication events, with AtPLC4 and AtPLC5 and AtPLC8 and AtPLC9 being duplicated in tandem in recent times.
Collapse
Affiliation(s)
- I Made Tasma
- Department of Agronomy, Iowa State University, G303 Agronomy Hall, Ames, IA 50011, USA
| | - Volker Brendel
- Department of Genetics, Development and Cell Biology and Department of Statistics, Iowa State University, Ames, IA 50011, USA
| | - Steven A Whitham
- Department of Plant Pathology, Iowa State University, Ames, IA 50011, USA
| | - Madan K Bhattacharyya
- Department of Agronomy, Iowa State University, G303 Agronomy Hall, Ames, IA 50011, USA
| |
Collapse
|
14
|
Vergara IA, Norambuena T, Ferrada E, Slater AW, Melo F. StAR: a simple tool for the statistical comparison of ROC curves. BMC Bioinformatics 2008; 9:265. [PMID: 18534022 PMCID: PMC2435548 DOI: 10.1186/1471-2105-9-265] [Citation(s) in RCA: 135] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2007] [Accepted: 06/05/2008] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND As in many different areas of science and technology, most important problems in bioinformatics rely on the proper development and assessment of binary classifiers. A generalized assessment of the performance of binary classifiers is typically carried out through the analysis of their receiver operating characteristic (ROC) curves. The area under the ROC curve (AUC) constitutes a popular indicator of the performance of a binary classifier. However, the assessment of the statistical significance of the difference between any two classifiers based on this measure is not a straightforward task, since not many freely available tools exist. Most existing software is either not free, difficult to use or not easy to automate when a comparative assessment of the performance of many binary classifiers is intended. This constitutes the typical scenario for the optimization of parameters when developing new classifiers and also for their performance validation through the comparison to previous art. RESULTS In this work we describe and release new software to assess the statistical significance of the observed difference between the AUCs of any two classifiers for a common task estimated from paired data or unpaired balanced data. The software is able to perform a pairwise comparison of many classifiers in a single run, without requiring any expert or advanced knowledge to use it. The software relies on a non-parametric test for the difference of the AUCs that accounts for the correlation of the ROC curves. The results are displayed graphically and can be easily customized by the user. A human-readable report is generated and the complete data resulting from the analysis are also available for download, which can be used for further analysis with other software. The software is released as a web server that can be used in any client platform and also as a standalone application for the Linux operating system. CONCLUSION A new software for the statistical comparison of ROC curves is released here as a web server and also as standalone software for the LINUX operating system.
Collapse
Affiliation(s)
- Ismael A Vergara
- Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile
| | - Tomás Norambuena
- Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile
| | - Evandro Ferrada
- Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile
| | - Alex W Slater
- Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile
| | - Francisco Melo
- Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile
| |
Collapse
|
15
|
Jameson N, Georgelis N, Fouladbash E, Martens S, Hannah LC, Lal S. Helitron mediated amplification of cytochrome P450 monooxygenase gene in maize. PLANT MOLECULAR BIOLOGY 2008; 67:295-304. [PMID: 18327644 DOI: 10.1007/s11103-008-9318-4] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/23/2007] [Accepted: 02/22/2008] [Indexed: 05/13/2023]
Abstract
The mass movement of gene sequences by Helitrons has significantly contributed to the lack of gene collinearity reported between different maize inbred lines. However, Helitron captured-genes reported to date represent truncated versions of their progenitor genes. In this report, we provide evidence that maize CYP72A27-Zm gene represents a cytochrome P450 monooxygenase (P450) gene recently captured by a Helitron and transposed into an Opie-2 retroposon. The four exons of the CYP72A27 gene contained within the element contain a putative open reading frame (ORF) for 428 amino acid residues. We provide evidence that Helitron captured CYP72A27-Zm is transcribed. To identify the progenitor gene and the evolutionary time of capture, we searched the plant genome database and discovered other closely related CYP72A27-Zm genes in maize and grasses. Our analysis indicates that CYP72A27-Zm represents an almost complete copy of maize CYP72A26-Zm gene captured by a Helitron about 3.1 million years ago (mya). The Helitron-captured gene then duplicated twice, approximately 1.5-1.6 mya giving rise to CYP72A36-Zm and CYP72A37-Zm. These data provide evidence that Helitrons can capture and mobilize intact genes that are transcribed and potentially encode biologically relevant proteins.
Collapse
Affiliation(s)
- Natalie Jameson
- Department of Biological Sciences, Oakland University, Rochester, MI 48309-4401, USA
| | | | | | | | | | | |
Collapse
|
16
|
Gupta S, Ciungu A, Jameson N, Lal SK. Alternative splicing expression of U1 snRNP 70K gene is evolutionary conserved between different plant species. ACTA ACUST UNITED AC 2007; 17:254-61. [PMID: 17312944 DOI: 10.1080/10425170600856642] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
A U1-snRNP--specific 70K (U1-70K) protein is intricately involved in both constitutive and alternative splicing of pre-mRNAs. Here, we report cDNA and cognate genomic sequences of the U1-70K gene of maize and rice. The maize and rice U1-70K genes bear strong similarity to the Arabidopsis gene and each encode three transcripts in roots and shoots. Alternative splicing produces two transcripts from each gene in addition to the mRNA encoding the wild type protein. In both cases, selective inclusion of intron 6 or utilization of a cryptic donor site within intron 6 sequence generates the two alternatively spliced transcripts. This evolutionary conservation of splicing patterns between different plant species suggests an important biological function for alternative splicing in the expression of U1-70K gene.
Collapse
Affiliation(s)
- Smriti Gupta
- Department of Biological Sciences, Oakland University, Rochester, MI 48309-4401, USA
| | | | | | | |
Collapse
|
17
|
Dong Q, Lawrence CJ, Schlueter SD, Wilkerson MD, Kurtz S, Lushbough C, Brendel V. Comparative plant genomics resources at PlantGDB. PLANT PHYSIOLOGY 2005; 139:610-8. [PMID: 16219921 PMCID: PMC1255980 DOI: 10.1104/pp.104.059212] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/31/2004] [Revised: 05/04/2005] [Accepted: 05/12/2005] [Indexed: 05/04/2023]
Abstract
PlantGDB (http://www.plantgdb.org/) is a database of plant molecular sequences. Expressed sequence tag (EST) sequences are assembled into contigs that represent tentative unique genes. EST contigs are functionally annotated with information derived from known protein sequences that are highly similar to the putative translation products. Tentative Gene Ontology terms are assigned to match those of the similar sequences identified. Genome survey sequences are assembled similarly. The resulting genome survey sequence contigs are matched to ESTs and conserved protein homologs to identify putative full-length open reading frame-containing genes, which are subsequently provisionally classified according to established gene family designations. For Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa), the exon-intron boundaries for gene structures are annotated by spliced alignment of ESTs and full-length cDNAs to their respective complete genome sequences. Unique genome browsers have been developed to present all available EST and cDNA evidence for current transcript models (for Arabidopsis, see the AtGDB site at http://www.plantgdb.org/AtGDB/; for rice, see the OsGDB site at http://www.plantgdb.org/OsGDB/). In addition, a number of bioinformatic tools have been integrated at PlantGDB that enable researchers to carry out sequence analyses on-site using both their own data and data residing within the database.
Collapse
Affiliation(s)
- Qunfeng Dong
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, 50011-3260, USA
| | | | | | | | | | | | | |
Collapse
|
18
|
Abstract
Expressed sequence tag (EST) data are a major contributor to the known plant sequence space. Organization of the data into non-redundant clusters representing tentative unique genes provides snapshots of the gene repertoires of a species. This chapter reviews availability of sequences and sequence analysis results and describes several resources and tools that should facilitate broad-based utilization of EST data for gene structure annotation, gene discovery, and comparative genomics.
Collapse
Affiliation(s)
- Qunfeng Dong
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa 50011-3260, USA
| | | | | | | | | |
Collapse
|
19
|
Yao H, Guo L, Fu Y, Borsuk LA, Wen TJ, Skibbe DS, Cui X, Scheffler BE, Cao J, Emrich SJ, Ashlock DA, Schnable PS. Evaluation of five ab initio gene prediction programs for the discovery of maize genes. PLANT MOLECULAR BIOLOGY 2005; 57:445-60. [PMID: 15830133 DOI: 10.1007/s11103-005-0271-1] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/16/2004] [Accepted: 01/06/2005] [Indexed: 05/20/2023]
Abstract
Five ab initio programs (FGENESH, GeneMark.hmm, GENSCAN, GlimmerR and Grail) were evaluated for their accuracy in predicting maize genes. Two of these programs, GeneMark.hmm and GENSCAN had been trained for maize; FGENESH had been trained for monocots (including maize), and the others had been trained for rice or Arabidopsis. Initial evaluations were conducted using eight maize genes (gl8a, pdc2, pdc3, rf2c, rf2d, rf2e1, rth1, and rth3) of which the sequences were not released to the public prior to conducting this evaluation. The significant advantage of this data set for this evaluation is that these genes could not have been included in the training sets of the prediction programs. FGENESH yielded the most accurate and GeneMark.hmm the second most accurate predictions. The five programs were used in conjunction with RT-PCR to identify and establish the structures of two new genes in the a1-sh2 interval of the maize genome. FGENESH, GeneMark.hmm and GENSCAN were tested on a larger data set consisting of maize assembled genomic islands (MAGIs) that had been aligned to ESTs. FGENESH, GeneMark.hmm and GENSCAN correctly predicted gene models in 773, 625, and 371 MAGIs, respectively, out of the 1353 MAGIs that comprise data set 2.
Collapse
Affiliation(s)
- Hong Yao
- Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, 50011-3650, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Gupta S, Gallavotti A, Stryker GA, Schmidt RJ, Lal SK. A novel class of Helitron-related transposable elements in maize contain portions of multiple pseudogenes. PLANT MOLECULAR BIOLOGY 2005; 57:115-27. [PMID: 15821872 DOI: 10.1007/s11103-004-6636-z] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2004] [Revised: 11/27/2004] [Indexed: 05/08/2023]
Abstract
We recently described a maize mutant caused by an insertion of a Helitron type transposable element (Lal, S.K., Giroux, M.J., Brendel, V., Vallejos, E. and Hannah, L.C., 2003, Plant Cell, 15: 381-391). Here we describe another Helitron insertion in the barren stalk1 gene of maize. The termini of a 6525 bp insertion in the proximal promoter region of the mutant reference allele of maize barren stalk1 gene (ba1-ref) shares striking similarity to the Helitron insertion we reported in the Shrunken-2 gene. This insertion is embedded with pseudogenes that differ from the pseudogenes discovered in the mutant Shrunken-2 insertion. Using the common terminal ends of the mutant insertions as a query, we discovered other Helitron insertions in maize BAC clones. Based on the comparison of the insertion site and PCR amplified genomic sequences, these elements inserted between AT dinucleotides. These putative non-autonomous Helitron insertions completely lacked sequences similar to RPA (replication protein A) and DNA Helicases reported in other species. A blastn analysis indicated that both the 5' and 3' termini of Helitrons are repeated in the maize genome. These data provide strong evidence that Helitron type transposable elements are active and may have played an essential role in the evolution and expansion of the maize genome.
Collapse
MESH Headings
- Base Sequence
- Binding Sites/genetics
- Chromosomes, Artificial, Bacterial/genetics
- Cloning, Molecular
- DNA Helicases/genetics
- DNA Transposable Elements/genetics
- DNA, Plant/chemistry
- DNA, Plant/genetics
- Genes, Plant/genetics
- Genome, Plant
- Molecular Sequence Data
- Multigene Family/genetics
- Mutagenesis, Insertional
- Mutation
- Plant Proteins/genetics
- Pseudogenes/genetics
- Repetitive Sequences, Nucleic Acid/genetics
- Sequence Alignment
- Sequence Analysis, DNA
- Sequence Homology, Nucleic Acid
- Zea mays/genetics
- Zein/genetics
Collapse
Affiliation(s)
- Smriti Gupta
- Department of Biological Sciences, Oakland University, Rochester, MI 48309-4401, USA
| | | | | | | | | |
Collapse
|
21
|
Degroeve S, Saeys Y, De Baets B, Rouzé P, Van de Peer Y. SpliceMachine: predicting splice sites from high-dimensional local context representations. Bioinformatics 2004; 21:1332-8. [PMID: 15564294 DOI: 10.1093/bioinformatics/bti166] [Citation(s) in RCA: 75] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION In this age of complete genome sequencing, finding the location and structure of genes is crucial for further molecular research. The accurate prediction of intron boundaries largely facilitates the correct prediction of gene structure in nuclear genomes. Many tools for localizing these boundaries on DNA sequences have been developed and are available to researchers through the internet. Nevertheless, these tools still make many false positive predictions. RESULTS This manuscript presents a novel publicly available splice site prediction tool named SpliceMachine that (i) shows state-of-the-art prediction performance on Arabidopsis thaliana and human sequences, (ii) performs a computationally fast annotation and (iii) can be trained by the user on its own data. AVAILABILITY Results, figures and software are available at http://www.bioinformatics.psb.ugent.be/supplementary_data/ CONTACT sven.degroeve@psb.ugent.be; yves.vandepeer@psb.ugent.be.
Collapse
Affiliation(s)
- Sven Degroeve
- Department of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology (VIB), Technologiepark 927, Gent 9052, Belgium.
| | | | | | | | | |
Collapse
|
22
|
Li LH, Li JC, Lin YF, Lin CY, Chen CY, Tsai SF. Genomic shotgun array: a procedure linking large-scale DNA sequencing with regional transcript mapping. Nucleic Acids Res 2004; 32:e27. [PMID: 14960710 PMCID: PMC373421 DOI: 10.1093/nar/gnh025] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
To facilitate transcript mapping and to investigate alterations in genomic structure and gene expression in a defined genomic target, we developed a novel microarray-based method to detect transcriptional activity of the human chromosome 4q22-24 region. Loss of heterozygosity of human 4q22-24 is frequently observed in hepatocellular carcinoma (HCC). One hundred and eighteen well-characterized genes have been identified from this region. We took previously sequenced shotgun subclones as templates to amplify overlapping sequences for the genomic segment and constructed a chromosome-region-specific microarray. Using genomic DNA fragments as probes, we detected transcriptional activity from within this region among five different tissues. The hybridization results indicate that there are new transcripts that have not yet been identified by other methods. The existence of new transcripts encoded by genes in this region was confirmed by PCR cloning or cDNA library screening. The procedure reported here allows coupling of shotgun sequencing with transcript mapping and, potentially, detailed analysis of gene expression and chromosomal copy of the genomic sequence for the putative HCC tumor suppressor gene(s) in the 4q candidate region.
Collapse
Affiliation(s)
- Ling-Hui Li
- Division of Molecular and Genomic Medicine, National Health Research Institutes, Taipei, Taiwan
| | | | | | | | | | | |
Collapse
|
23
|
Zhu W, Schlueter SD, Brendel V. Refined annotation of the Arabidopsis genome by complete expressed sequence tag mapping. PLANT PHYSIOLOGY 2003; 132:469-84. [PMID: 12805580 PMCID: PMC166990 DOI: 10.1104/pp.102.018101] [Citation(s) in RCA: 76] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/21/2002] [Revised: 01/06/2003] [Accepted: 02/20/2003] [Indexed: 05/18/2023]
Abstract
Expressed sequence tags (ESTs) currently encompass more entries in the public databases than any other form of sequence data. Thus, EST data sets provide a vast resource for gene identification and expression profiling. We have mapped the complete set of 176,915 publicly available Arabidopsis EST sequences onto the Arabidopsis genome using GeneSeqer, a spliced alignment program incorporating sequence similarity and splice site scoring. About 96% of the available ESTs could be properly aligned with a genomic locus, with the remaining ESTs deriving from organelle genomes and non-Arabidopsis sources or displaying insufficient sequence quality for alignment. The mapping provides verified sets of EST clusters for evaluation of EST clustering programs. Analysis of the spliced alignments suggests corrections to current gene structure annotation and provides examples of alternative and non-canonical pre-mRNA splicing. All results of this study were parsed into a database and are accessible via a flexible Web interface at http://www.plantgdb.org/AtGDB/.
Collapse
Affiliation(s)
- Wei Zhu
- Department of Zoology and Genetics, Iowa State University, Ames 50011-3260, USA
| | | | | |
Collapse
|
24
|
Schoof H, Karlowski WM. Comparison of rice and Arabidopsis annotation. CURRENT OPINION IN PLANT BIOLOGY 2003; 6:106-112. [PMID: 12667865 DOI: 10.1016/s1369-5266(03)00003-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Several versions of the rice genome were published in 2002, providing a first overview of the genome content of this model monocot. At the same time, the genome of the model dicot, Arabidopsis thaliana, reached a new level of annotation as thousands of full-length cDNA sequences were integrated with the genome sequence.
Collapse
Affiliation(s)
- Heiko Schoof
- Technical University of Munich, Genome Oriented Bioinformatics, Wissenschaftszentrum Weihenstephan, 85350 Freising, Germany.
| | | |
Collapse
|
25
|
Zhang W, Wang Y, Long J, Girton J, Johansen J, Johansen KM. A developmentally regulated splice variant from the complex lola locus encoding multiple different zinc finger domain proteins interacts with the chromosomal kinase JIL-1. J Biol Chem 2003; 278:11696-704. [PMID: 12538650 DOI: 10.1074/jbc.m213269200] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Using a yeast two-hybrid screen we have identified a novel isoform of the lola locus, Lola zf5, that interacts with the chromosomal kinase JIL-1. We characterized the lola locus and provide evidence that it is a complex locus from which at least 17 different splice variants are likely to be generated. Fifteen of these each have a different zinc finger domain, whereas two are without. This potential for expression of multiple gene products suggests that they serve diverse functional roles in different developmental contexts. By Northern and Western blot analyses we demonstrate that the expression of Lola zf5 is developmentally regulated and that it is restricted to early embryogenesis. Immunocytochemical labeling with a Lola zf5-specific antibody of Drosophila embryos indicates that Lola zf5 is localized to nuclei. Furthermore, by creating double-mutant flies we show that a reduction of Lola protein levels resulting from mutations in the lola locus acts as a dominant modifier of a hypomorphic JIL-1 allele leading to an increase in embryonic viability. Thus, genetic interaction assays provide direct evidence that gene products from the lola locus function within the same pathway as the chromosomal kinase JIL-1.
Collapse
Affiliation(s)
- Weiguo Zhang
- Department of Zoology and Genetics, Iowa State University, Ames, Iowa 50011, USA
| | | | | | | | | | | |
Collapse
|
26
|
Pachter L, Alexandersson M, Cawley S. Applications of generalized pair hidden Markov models to alignment and gene finding problems. J Comput Biol 2002; 9:389-99. [PMID: 12015888 DOI: 10.1089/10665270252935520] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Hidden Markov models (HMMs) have been successfully applied to a variety of problems in molecular biology, ranging from alignment problems to gene finding and annotation. Alignment problems can be solved with pair HMMs, while gene finding programs rely on generalized HMMs in order to model exon lengths. In this paper, we introduce the generalized pair HMM (GPHMM), which is an extension of both pair and generalized HMMs. We show how GPHMMs, in conjunction with approximate alignments, can be used for cross-species gene finding and describe applications to DNA-cDNA and DNA-protein alignment. GPHMMs provide a unifying and probabilistically sound theory for modeling these problems.
Collapse
Affiliation(s)
- Lior Pachter
- Department of Mathematics, University of California Berkeley, Berkeley, CA 94720, USA.
| | | | | |
Collapse
|
27
|
Mathé C, Sagot MF, Schiex T, Rouzé P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 2002; 30:4103-17. [PMID: 12364589 PMCID: PMC140543 DOI: 10.1093/nar/gkf543] [Citation(s) in RCA: 209] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2002] [Revised: 08/07/2002] [Accepted: 08/07/2002] [Indexed: 11/14/2022] Open
Abstract
While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along a genome. This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations. The main mathematical models and computational algorithms adopted are also briefly described and the resulting software classified according to both the method and the type of evidence used. Finally, the several difficulties and pitfalls encountered by the programs are detailed, showing that improvements are needed and that new directions must be considered.
Collapse
Affiliation(s)
- Catherine Mathé
- Institut de Pharmacologie et Biologie Structurale, UMR 5089, 205 route de Narbonne, F-31077 Toulouse Cedex, France.
| | | | | | | |
Collapse
|
28
|
Abstract
The completed Arabidopsis genome seems to be of limited value as a model for maize genomics. In addition to the expansion of repetitive sequences in maize and the lack of genomic micro-colinearity, maize-specific or highly-diverged proteins contribute to a predicted maize proteome of about 50,000 proteins, twice the size of that of Arabidopsis.
Collapse
Affiliation(s)
- Volker Brendel
- Department of Zoology and Genetics and Department of Statistics, Iowa State University, Ames, IA 50010, USA.
| | | | | |
Collapse
|
29
|
Abstract
In the post-genomic era, the new discipline of functional genomics is now facing the challenge of associating a function (as well as estimating its relevance to industrial applications) to about 100,000 microbial, plant or animal genes of known sequence but unknown function. Besides the design of databases, computational methods are increasingly becoming intimately linked with the various experimental approaches. Consequently, bioinformatics is rapidly evolving into independent fields addressing the specific problems of interpreting i) genomic sequences, ii) protein sequences and 3D-structures, as well as iii) transcriptome and macromolecular interaction data. It is thus increasingly difficult for the biologist to choose the computational approaches that perform best in these various areas. This paper attempts to review the most useful developments of the last 2 years.
Collapse
Affiliation(s)
- J M Claverie
- Structural and Genetic Information Laboratory,UMR 1889 CNRS-AVENTIS, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France.
| | | | | | | |
Collapse
|
30
|
Abstract
Since the structure of the DNA molecule was identified half a century ago, the complete genome sequence has been determined for 37 prokaryotes and several eukaryotes. With the exponential growth of genetic information, bioinformatics has attempted to predict gene locations and functions in cyberspace prior to experimental confirmation at the bench.
Collapse
Affiliation(s)
- Y Cho
- Stanford Genome Technology Center, 855 California Avenue, Palo Alto, CA 94304-1103, USA.
| | | |
Collapse
|
31
|
Stamm S, Zhu J, Nakai K, Stoilov P, Stoss O, Zhang MQ. An alternative-exon database and its statistical analysis. DNA Cell Biol 2000; 19:739-56. [PMID: 11177572 DOI: 10.1089/104454900750058107] [Citation(s) in RCA: 129] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We compiled a comprehensive database of alternative exons from the literature and analyzed them statistically. Most alternative exons are cassette exons and are expressed in more than two tissues. Of all exons whose expression was reported to be specific for a certain tissue, the majority were expressed in the brain. Whereas the length of constitutive exons follows a normal distribution, the distribution of alternative exons is skewed toward smaller ones. Furthermore, alternative-exon splice sites deviate more from the consensus: their 3' splice sites are characterized by a higher purine content in the polypyrimidine stretch, and their 5' splice sites deviate from the consensus sequence mostly at the +4 and +5 positions. Furthermore, for exons expressed in a single tissue, adenosine is more frequently used at the -3 position of the 3' splice site. In addition to the known AC-rich and purine-rich exonic sequence elements, sequence comparison using a Gibbs algorithm identified several motifs in exons surrounded by weak splice sites and in tissue-specific exons. Together, these data indicate a combinatorial effect of weak splice sites, atypical nucleotide usage at certain positions, and functional enhancers as an important contribution to alternative-exon regulation.
Collapse
Affiliation(s)
- S Stamm
- Institute of Biochemistry, University of Erlangen-Nuremberg, Erlangen, Germany.
| | | | | | | | | | | |
Collapse
|
32
|
Current awareness on comparative and functional genomics. Yeast 2000; 17:339-46. [PMID: 11119313 PMCID: PMC2448380 DOI: 10.1002/1097-0061(200012)17:4<339::aid-yea10>3.0.co;2-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
|