1
|
Dvorkina T, Bankevich A, Sorokin A, Yang F, Adu-Oppong B, Williams R, Turner K, Pevzner PA. ORFograph: search for novel insecticidal protein genes in genomic and metagenomic assembly graphs. MICROBIOME 2021; 9:149. [PMID: 34183047 PMCID: PMC8240309 DOI: 10.1186/s40168-021-01092-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 05/11/2021] [Indexed: 05/07/2023]
Abstract
BACKGROUND Since the prolonged use of insecticidal proteins has led to toxin resistance, it is important to search for novel insecticidal protein genes (IPGs) that are effective in controlling resistant insect populations. IPGs are usually encoded in the genomes of entomopathogenic bacteria, especially in large plasmids in strains of the ubiquitous soil bacteria, Bacillus thuringiensis (Bt). Since there are often multiple similar IPGs encoded by such plasmids, their assemblies are typically fragmented and many IPGs are scattered through multiple contigs. As a result, existing gene prediction tools (that analyze individual contigs) typically predict partial rather than complete IPGs, making it difficult to conduct downstream IPG engineering efforts in agricultural genomics. METHODS Although it is difficult to assemble IPGs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding a single IPG. RESULTS We describe ORFograph, a pipeline for predicting IPGs in assembly graphs, benchmark it on (meta)genomic datasets, and discover nearly a hundred novel IPGs. This work shows that graph-aware gene prediction tools enable the discovery of greater diversity of IPGs from (meta)genomes. CONCLUSIONS We demonstrated that analysis of the assembly graphs reveals novel candidate IPGs. ORFograph identified both already known genes "hidden" in assembly graphs and potential novel IPGs that evaded existing tools for IPG identification. As ORFograph is fast, one could imagine a pipeline that processes many (meta)genomic assembly graphs to identify even more novel IPGs for phenotypic testing than would previously be inaccessible by traditional gene-finding methods. While here we demonstrated the results of ORFograph only for IPGs, the proposed approach can be generalized to any class of genes. Video abstract.
Collapse
Affiliation(s)
- Tatiana Dvorkina
- Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
| | - Anton Bankevich
- Department of Computer Science and Engineering, University of California San Diego, San Diego, CA USA
| | - Alexei Sorokin
- Université Paris-Saclay, INRAE, Micalis Institute, AgroParisTech, 78350 Jouy-en-Josas, France
| | - Fan Yang
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
- Ascus Biosciences, San Diego, CA USA
| | - Boahemaa Adu-Oppong
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
- Thermo Fisher Scientific, Carlsbad, CA USA
| | - Ryan Williams
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
| | - Keith Turner
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
| | - Pavel A. Pevzner
- Department of Computer Science and Engineering, University of California San Diego, San Diego, CA USA
| |
Collapse
|
2
|
Brůna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform 2021; 3:lqaa108. [PMID: 33575650 PMCID: PMC7787252 DOI: 10.1093/nargab/lqaa108] [Citation(s) in RCA: 510] [Impact Index Per Article: 170.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Revised: 11/26/2020] [Accepted: 12/20/2020] [Indexed: 12/13/2022] Open
Abstract
The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.
Collapse
Affiliation(s)
- Tomáš Brůna
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Katharina J Hoff
- Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany
- Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Alexandre Lomsadze
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Mario Stanke
- Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany
- Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Mark Borodovsky
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| |
Collapse
|
3
|
Levy Karin E, Mirdita M, Söding J. MetaEuk-sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics. MICROBIOME 2020; 8:48. [PMID: 32245390 PMCID: PMC7126354 DOI: 10.1186/s40168-020-00808-x] [Citation(s) in RCA: 86] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2019] [Accepted: 02/14/2020] [Indexed: 05/10/2023]
Abstract
BACKGROUND Metagenomics is revolutionizing the study of microorganisms and their involvement in biological, biomedical, and geochemical processes, allowing us to investigate by direct sequencing a tremendous diversity of organisms without the need for prior cultivation. Unicellular eukaryotes play essential roles in most microbial communities as chief predators, decomposers, phototrophs, bacterial hosts, symbionts, and parasites to plants and animals. Investigating their roles is therefore of great interest to ecology, biotechnology, human health, and evolution. However, the generally lower sequencing coverage, their more complex gene and genome architectures, and a lack of eukaryote-specific experimental and computational procedures have kept them on the sidelines of metagenomics. RESULTS MetaEuk is a toolkit for high-throughput, reference-based discovery, and annotation of protein-coding genes in eukaryotic metagenomic contigs. It performs fast searches with 6-frame-translated fragments covering all possible exons and optimally combines matches into multi-exon proteins. We used a benchmark of seven diverse, annotated genomes to show that MetaEuk is highly sensitive even under conditions of low sequence similarity to the reference database. To demonstrate MetaEuk's power to discover novel eukaryotic proteins in large-scale metagenomic data, we assembled contigs from 912 samples of the Tara Oceans project. MetaEuk predicted >12,000,000 protein-coding genes in 8 days on ten 16-core servers. Most of the discovered proteins are highly diverged from known proteins and originate from very sparsely sampled eukaryotic supergroups. CONCLUSION The open-source (GPLv3) MetaEuk software (https://github.com/soedinglab/metaeuk) enables large-scale eukaryotic metagenomics through reference-based, sensitive taxonomic and functional annotation. Video abstract.
Collapse
Affiliation(s)
- Eli Levy Karin
- Quantitative and Computational Biology, Max-Planck Institute for Biophysical Chemistry, 37077, Göttingen, Germany.
| | - Milot Mirdita
- Quantitative and Computational Biology, Max-Planck Institute for Biophysical Chemistry, 37077, Göttingen, Germany
| | - Johannes Söding
- Quantitative and Computational Biology, Max-Planck Institute for Biophysical Chemistry, 37077, Göttingen, Germany.
| |
Collapse
|
4
|
Abstract
Rapidly improving sequencing technology coupled with computational developments in sequence assembly are making reference-quality genome assembly economical. Hundreds of vertebrate genome assemblies are now publicly available, and projects are being proposed to sequence thousands of additional species in the next few years. Such dense sampling of the tree of life should give an unprecedented new understanding of evolution and allow a detailed determination of the events that led to the wealth of biodiversity around us. To gain this knowledge, these new genomes must be compared through genome alignment (at the sequence level) and comparative annotation (at the gene level). However, different alignment and annotation methods have different characteristics; before starting a comparative genomics analysis, it is important to understand the nature of, and biases and limitations inherent in, the chosen methods. This review is intended to act as a technical but high-level overview of the field that should provide this understanding. We briefly survey the state of the genome alignment and comparative annotation fields and potential future directions for these fields in a new, large-scale era of comparative genomics.
Collapse
Affiliation(s)
- Joel Armstrong
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Ian T Fiddes
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95064, USA;
- 10x Genomics, Pleasanton, California 94566, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| |
Collapse
|
5
|
Kou Q, Wu S, Tolic N, Paša-Tolic L, Liu Y, Liu X. A mass graph-based approach for the identification of modified proteoforms using top-down tandem mass spectra. Bioinformatics 2018; 33:1309-1316. [PMID: 28453668 DOI: 10.1093/bioinformatics/btw806] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2016] [Accepted: 12/15/2016] [Indexed: 11/14/2022] Open
Abstract
Motivation Although proteomics has rapidly developed in the past decade, researchers are still in the early stage of exploring the world of complex proteoforms, which are protein products with various primary structure alterations resulting from gene mutations, alternative splicing, post-translational modifications, and other biological processes. Proteoform identification is essential to mapping proteoforms to their biological functions as well as discovering novel proteoforms and new protein functions. Top-down mass spectrometry is the method of choice for identifying complex proteoforms because it provides a 'bird's eye view' of intact proteoforms. The combinatorial explosion of various alterations on a protein may result in billions of possible proteoforms, making proteoform identification a challenging computational problem. Results We propose a new data structure, called the mass graph, for efficient representation of proteoforms and design mass graph alignment algorithms. We developed TopMG, a mass graph-based software tool for proteoform identification by top-down mass spectrometry. Experiments on top-down mass spectrometry datasets showed that TopMG outperformed existing methods in identifying complex proteoforms. Availability and implementation http://proteomics.informatics.iupui.edu/software/topmg/. Contact xwliu@iupui.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qiang Kou
- Department of BioHealth Informatics, Indiana University-Purdue University Indianapolis, Indianapolis, IN 46202, USA
| | - Si Wu
- Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK 73019, USA
| | - Nikola Tolic
- Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99354, USA
| | - Ljiljana Paša-Tolic
- Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99354, USA
| | - Yunlong Liu
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA.,Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Xiaowen Liu
- Department of BioHealth Informatics, Indiana University-Purdue University Indianapolis, Indianapolis, IN 46202, USA.,Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| |
Collapse
|
6
|
Chowdhury B, Garai A, Garai G. An optimized approach for annotation of large eukaryotic genomic sequences using genetic algorithm. BMC Bioinformatics 2017; 18:460. [PMID: 29065853 PMCID: PMC5655831 DOI: 10.1186/s12859-017-1874-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 10/17/2017] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND Detection of important functional and/or structural elements and identification of their positions in a large eukaryotic genomic sequence are an active research area. Gene is an important functional and structural unit of DNA. The computation of gene prediction is, therefore, very essential for detailed genome annotation. RESULTS In this paper, we propose a new gene prediction technique based on Genetic Algorithm (GA) to determine the optimal positions of exons of a gene in a chromosome or genome. The correct identification of the coding and non-coding regions is difficult and computationally demanding. The proposed genetic-based method, named Gene Prediction with Genetic Algorithm (GPGA), reduces this problem by searching only one exon at a time instead of all exons along with its introns. This representation carries a significant advantage in that it breaks the entire gene-finding problem into a number of smaller sub-problems, thereby reducing the computational complexity. We tested the performance of the GPGA with existing benchmark datasets and compared the results with well-known and relevant techniques. The comparison shows the better or comparable performance of the proposed method. We also used GPGA for annotating the human chromosome 21 (HS21) using cross-species comparisons with the mouse orthologs. CONCLUSION It was noted that the GPGA predicted true genes with better accuracy than other well-known approaches.
Collapse
Affiliation(s)
- Biswanath Chowdhury
- Department of Biophysics, Molecular Biology and Bioinformatics, University of Calcutta, Kolkata, 700009 WB India
| | - Arnav Garai
- Unit of Energy, Utilities, Communications and Services, Infosys Technologies Ltd., Bhubaneswar, 751024 Odisha India
| | - Gautam Garai
- Computational Sciences Division, Saha Institute of Nuclear Physics, Kolkata, 700064 WB India
| |
Collapse
|
7
|
Singh A, Mishra A, Khosravi A, Khandelwal G, Jayaram B. Physico-chemical fingerprinting of RNA genes. Nucleic Acids Res 2017; 45:e47. [PMID: 27932456 PMCID: PMC5397174 DOI: 10.1093/nar/gkw1236] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2016] [Accepted: 11/29/2016] [Indexed: 12/13/2022] Open
Abstract
We advance here a novel concept for characterizing different classes of RNA genes on the basis of physico-chemical properties of DNA sequences. As knowledge-based approaches could yield unsatisfactory outcomes due to limitations of training on available experimental data sets, alternative approaches that utilize properties intrinsic to DNA are needed to supplement training based methods and to eventually provide molecular insights into genome organization. Based on a comprehensive series of molecular dynamics simulations of Ascona B-DNA consortium, we extracted hydrogen bonding, stacking and solvation energies of all combinations of DNA sequences at the dinucleotide level and calculated these properties for different types of RNA genes. Considering ∼7.3 million mRNA, 255 524 tRNA, 40 649 rRNA (different subunits) and 5250 miRNA, 3747 snRNA, gene sequences from 9282 complete genome chromosomes of all prokaryotes and eukaryotes available at NCBI, we observed that physico-chemical properties of different functional units on genomic DNA differ in their signatures.
Collapse
Affiliation(s)
- Ankita Singh
- Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India
| | - Akhilesh Mishra
- Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India.,Kusuma School of Biological Sciences, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India
| | - Ali Khosravi
- Ale-Taha Institute of Higher Education, Tehran, Iran
| | - Garima Khandelwal
- Cancer Research UK Manchester Institute, The University of Manchester, Wilmslow Road, Manchester M20 4BX, UK
| | - B Jayaram
- Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India.,Kusuma School of Biological Sciences, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India.,Department of Chemistry, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India
| |
Collapse
|
8
|
Bermudez-Santana CI. APLICACIONES DE LA BIOINFORMÁTICA EN LA MEDICINA: EL GENOMA HUMANO. ¿CÓMO PODEMOS VER TANTO DETALLE? ACTA BIOLÓGICA COLOMBIANA 2016. [DOI: 10.15446/abc.v21n1supl.51233] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
<p lang="es-ES" align="JUSTIFY">La bioinformática es un campo novedoso que soporta parte de la investigación biológica dirigida a la identificación de variantes génicas que pueden ser descubiertas desde los estudios de genomas completos. Basados en esta motivación se presenta el panorama general de los aportes principales de la bioinformática en el desarrollo del secuenciamiento del primer genoma humano. Adicionalmente se resumen los principales avances en cómputo desarrollados para responder a las demandas requeridas por los métodos de secuenciamiento de última generación para lograr re-secuenciar un genoma humano. Finalmente se introducen algunos de los nuevos retos que deben asumirse para aplicar la genómica personalizada en el desarrollo de la medicina. </p><p lang="es-ES" align="JUSTIFY"> </p><p lang="es-ES" align="JUSTIFY">Abstract</p><p lang="es-ES" align="JUSTIFY">Bioinformatics is a new field that supports part of the biological research aimed at identifying gene variants that can be discovered from studies of whole genomes. Based on this motivation the overview of the main contributions of bioinformatics in the development of sequencing the first human genome is presented. Additionally it is summarized the main advances in computing developed to meet the demands to re-sequence a human genome by using the next generation sequencing technologies. Finally some new challenges that must be faced to apply the personalized genomics into the medicine development are introduced.</p>
Collapse
|
9
|
Podicheti R, Mockaitis K. FEATnotator: A tool for integrated annotation of sequence features and variation, facilitating interpretation in genomics experiments. Methods 2015; 79-80:11-7. [PMID: 25934264 DOI: 10.1016/j.ymeth.2015.04.028] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2014] [Revised: 03/25/2015] [Accepted: 04/22/2015] [Indexed: 11/16/2022] Open
Abstract
As approaches are sought for more efficient and democratized uses of non-model and expanded model genomics references, ease of integration of genomic feature datasets is especially desirable in multidisciplinary research communities. Valuable conclusions are often missed or slowed when researchers refer experimental results to a single reference sequence that lacks integrated pan-genomic and multi-experiment data in accessible formats. Association of genomic positional information, such as results from an expansive variety of next-generation sequencing experiments, with annotated reference features such as genes or predicted protein binding sites, provides the context essential for conclusions and ongoing research. When the experimental system includes polymorphic genomic inputs, rapid calculation of gene structural and protein translational effects of sequence variation from the reference can be invaluable. Here we present FEATnotator, a lightweight, fast and easy to use open source software program that integrates and reports overlap and proximity in genomic information from any user-defined datasets including those from next generation sequencing applications. We illustrate use of the tool by summarizing whole genome sequence variation of a widely used natural isolate of Arabidopsis thaliana in the context of gene models of the reference accession. Previous discovery of a protein coding deletion influencing root development is replicated rapidly. Appropriate even in investigations of a single gene or genic regions such as QTL, comprehensive reports provided by FEATnotator better prepare researchers for interpretation of their experimental results. The tool is available for download at http://featnotator.sourceforge.net.
Collapse
Affiliation(s)
- Ram Podicheti
- Center for Genomics and Bioinformatics, Indiana University, 1001 E. Third Street, Bloomington, IN 47405, USA; School of Informatics and Computing, Indiana University, 919 E. Tenth Street, Bloomington, IN 47408, USA.
| | - Keithanne Mockaitis
- Pervasive Technology Institute, Indiana University, 2709 E. Tenth Street, Bloomington, IN 47408, USA; Department of Biology, Indiana University, 915 E. Third Street, Bloomington, IN 47405, USA.
| |
Collapse
|
10
|
Adi SS, Ferreira CE. Syntenic global alignment and its application to the gene prediction problem. JOURNAL OF THE BRAZILIAN COMPUTER SOCIETY 2013. [DOI: 10.1007/s13173-013-0115-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Abstract
Given the increasing number of available genomic sequences, one now faces the task of identifying their protein coding regions. The gene prediction problem can be addressed in several ways, and one of the most promising methods makes use of information derived from the comparison of homologous sequences. In this work, we develop a new comparative-based gene prediction program, called Exon_Finder2. This tool is based on a new type of alignment we propose, called syntenic global alignment, that can deal satisfactorily with sequences that share regions with different rates of conservation. In addition to this new type of alignment itself, we also describe a dynamic programming algorithm that computes a best syntenic global alignment of two sequences, as well as its related score. The applicability of our approach was validated by the promising initial results achieved by Exon_Finder2. On a benchmark including 120 pairs of human and mouse genomic sequences, most of their encoded genes were successfully identified by our program.
Collapse
|
11
|
Chakraborty S. A fragmented alignment method detects a putative phosphorylation site and a putative BRC repeat in the Drosophila melanogaster BRCA2 protein. F1000Res 2013; 2:143. [PMID: 24627786 PMCID: PMC3924952 DOI: 10.12688/f1000research.2-143.v2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/07/2013] [Indexed: 11/28/2022] Open
Abstract
Mutations in the BRCA2 tumor suppressor protein leave individuals susceptible to breast, ovarian and other cancers. The BRCA2 protein is a critical component of the DNA repair pathways in eukaryotes, and also plays an integral role in fostering genomic variability through meiotic recombination. Although present in many eukaryotes, as a whole the
BRCA2 gene is weakly conserved. Conserved fragments of 30 amino acids (BRC repeats), which mediate interactions with the recombinase RAD51, helped detect orthologs of this protein in other organisms. The carboxy-terminal of the human BRCA2 has been shown to be phosphorylated by checkpoint kinases (Chk1/Chk2) at T3387, which regulate the sequestration of RAD51 on DNA damage. However, apart from three BRC repeats, the
Drosophila melanogaster gene has not been annotated and associated with other functionally relevant sequence fragments in human BRCA2. In the current work, the carboxy-terminal phosphorylation threonine site (E=9.1e-4) and a new BRC repeat (E=17e-4) in
D. melanogaster has been identified, using a fragmented alignment methodology (FRAGAL). In a similar study, FRAGAL has also identified a novel half-a- tetratricopeptide (HAT) motif (E=11e-4), a helical repeat motif implicated in various aspects of RNA metabolism, in Utp6 from yeast. The characteristic three aromatic residues with conserved spacing are observed in this new HAT repeat, further strengthening my claim. The reference and target sequences are sliced into overlapping fragments of equal parameterized lengths. All pairs of fragments in the reference and target proteins are aligned, and the gap penalties are adjusted to discourage gaps in the middle of the alignment. The results of the best matches are sorted based on differing criteria to aid the detection of known and putative sequences. The source code for FRAGAL results on these sequences is available at
https://github.com/sanchak/FragalCode, while the database can be accessed at
www.sanchak.com/fragal.html.
Collapse
Affiliation(s)
- Sandeep Chakraborty
- Department of Biological Sciences, Tata Institute of Fundamental Research, Mumbai, 400 005, India
| |
Collapse
|
12
|
Wu YW, Rho M, Doak TG, Ye Y. Stitching gene fragments with a network matching algorithm improves gene assembly for metagenomics. Bioinformatics 2013; 28:i363-i369. [PMID: 22962453 PMCID: PMC3436815 DOI: 10.1093/bioinformatics/bts388] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Motivation: One of the difficulties in metagenomic assembly is that homologous genes from evolutionarily closely related species may behave like repeats and confuse assemblers. As a result, small contigs, each representing a short gene fragment, instead of complete genes, may be reported by an assembler. This further complicates annotation of metagenomic datasets, as annotation tools (such as gene predictors or similarity search tools) typically perform poorly on configs encoding short gene fragments. Results: We present a novel way of using the de Bruijn graph assembly of metagenomes to improve the assembly of genes. A network matching algorithm is proposed for matching the de Bruijn graph of contigs against reference genes, to derive ‘gene paths’ in the graph (sequences of contigs containing gene fragments) that have the highest similarities to known genes, allowing gene fragments contained in multiple contigs to be connected to form more complete (or intact) genes. Tests on simulated and real datasets show that our approach (called GeneStitch) is able to significantly improve the assembly of genes from metagenomic sequences, by connecting contigs with the guidance of homologous genes—information that is orthogonal to the sequencing reads. We note that the improvement of gene assembly can be observed even when only distantly related genes are available as the reference. We further propose to use ‘gene graphs’ to represent the assembly of reads from homologous genes and discuss potential applications of gene graphs to improving functional annotation for metagenomics. Availability: The tools are available as open source for download at http://omics.informatics.indiana.edu/GeneStitch Contact:yye@indiana.edu
Collapse
Affiliation(s)
- Yu-Wei Wu
- School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA
| | | | | | | |
Collapse
|
13
|
Goodswen SJ, Kennedy PJ, Ellis JT. Evaluating high-throughput ab initio gene finders to discover proteins encoded in eukaryotic pathogen genomes missed by laboratory techniques. PLoS One 2012; 7:e50609. [PMID: 23226328 PMCID: PMC3511556 DOI: 10.1371/journal.pone.0050609] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2012] [Accepted: 10/24/2012] [Indexed: 11/25/2022] Open
Abstract
Next generation sequencing technology is advancing genome sequencing at an unprecedented level. By unravelling the code within a pathogen’s genome, every possible protein (prior to post-translational modifications) can theoretically be discovered, irrespective of life cycle stages and environmental stimuli. Now more than ever there is a great need for high-throughput ab initio gene finding. Ab initio gene finders use statistical models to predict genes and their exon-intron structures from the genome sequence alone. This paper evaluates whether existing ab initio gene finders can effectively predict genes to deduce proteins that have presently missed capture by laboratory techniques. An aim here is to identify possible patterns of prediction inaccuracies for gene finders as a whole irrespective of the target pathogen. All currently available ab initio gene finders are considered in the evaluation but only four fulfil high-throughput capability: AUGUSTUS, GeneMark_hmm, GlimmerHMM, and SNAP. These gene finders require training data specific to a target pathogen and consequently the evaluation results are inextricably linked to the availability and quality of the data. The pathogen, Toxoplasma gondii, is used to illustrate the evaluation methods. The results support current opinion that predicted exons by ab initio gene finders are inaccurate in the absence of experimental evidence. However, the results reveal some patterns of inaccuracy that are common to all gene finders and these inaccuracies may provide a focus area for future gene finder developers.
Collapse
Affiliation(s)
- Stephen J. Goodswen
- School of Medical and Molecular Sciences, and the Ithree Institute at the University of Technology Sydney (UTS), New South Wales, Australia
| | - Paul J. Kennedy
- School of Software, Faculty of Engineering and Information Technology and the Centre for Quantum Computation and Intelligent Systems at the University of Technology Sydney (UTS), New South Wales, Australia
| | - John T. Ellis
- School of Medical and Molecular Sciences, and the Ithree Institute at the University of Technology Sydney (UTS), New South Wales, Australia
- * E-mail:
| |
Collapse
|
14
|
Abstract
We present here a novel methodology for predicting new genes in prokaryotic genomes on the basis of inherent energetics of DNA. Regions of higher thermodynamic stability were identified, which were filtered based on already known annotations to yield a set of potentially new genes. These were then processed for their compatibility with the stereo-chemical properties of proteins and tripeptide frequencies of proteins in Swissprot data, which results in a reliable set of new genes in a genome. Quite surprisingly, the methodology identifies new genes even in well-annotated genomes. Also, the methodology can handle genomes of any GC-content, size and number of annotated genes.
Collapse
|
15
|
Goodswen SJ, Kennedy PJ, Ellis JT. A guide to in silico vaccine discovery for eukaryotic pathogens. Brief Bioinform 2012; 14:753-74. [PMID: 23097412 DOI: 10.1093/bib/bbs066] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
In this article, a framework for an in silico pipeline is presented as a guide to high-throughput vaccine candidate discovery for eukaryotic pathogens, such as helminths and protozoa. Eukaryotic pathogens are mostly parasitic and cause some of the most damaging and difficult to treat diseases in humans and livestock. Consequently, these parasitic pathogens have a significant impact on economy and human health. The pipeline is based on the principle of reverse vaccinology and is constructed from freely available bioinformatics programs. There are several successful applications of reverse vaccinology to the discovery of subunit vaccines against prokaryotic pathogens but not yet against eukaryotic pathogens. The overriding aim of the pipeline, which focuses on eukaryotic pathogens, is to generate through computational processes of elimination and evidence gathering a ranked list of proteins based on a scoring system. These proteins are either surface components of the target pathogen or are secreted by the pathogen and are of a type known to be antigenic. No perfect predictive method is yet available; therefore, the highest-scoring proteins from the list require laboratory validation.
Collapse
Affiliation(s)
- Stephen J Goodswen
- School of Medical and Molecular Sciences, Ithree Institute, University of Technology Sydney. Tel.: +61 2 9514 4161;
| | | | | |
Collapse
|
16
|
Iwata H, Gotoh O. Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res 2012; 40:e161. [PMID: 22848105 PMCID: PMC3488211 DOI: 10.1093/nar/gks708] [Citation(s) in RCA: 113] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Spliced alignment plays a central role in the precise identification of eukaryotic gene structures. Even though many spliced alignment programs have been developed, recent rapid progress in DNA sequencing technologies demands further improvements in software tools. Benchmarking algorithms under various conditions is an indispensable task for the development of better software; however, there is a dire lack of appropriate datasets usable for benchmarking spliced alignment programs. In this study, we have constructed two types of datasets: simulated sequence datasets and actual cross-species datasets. The datasets are designed to correspond to various real situations, i.e. divergent eukaryotic species, different types of reference sequences, and the wide divergence between query and target sequences. In addition, we have developed an extended version of our program Spaln, which incorporates two additional features to the scoring scheme of the original version, and examined this extended version, Spaln2, together with the original Spaln and other representative aligners based on our benchmark datasets. Although the effects of the modifications are not individually striking, Spaln2 is consistently most accurate and reasonably fast in most practical cases, especially for plants and fungi and for increasingly divergent pairs of target and query sequences.
Collapse
Affiliation(s)
- Hiroaki Iwata
- Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Yoshida Honmachi, Yoshida-Konoe-cho, Sakyo-ku, Kyoto 606-8501, Japan.
| | | |
Collapse
|
17
|
Abstract
Evolutionary genomics is a field that relies heavily upon comparing genomes, that is, the full complement of genes of one species with another. However, given a genome sequence and little else, as is now often the case, genes must first be found and annotated before downstream analyses can be done. Computational gene prediction techniques are brought to bear on the problem of constructing a genome annotation as manual annotation is extremely time-consuming and costly. This chapter reviews the methods by which the individual components of a typical gene structure are detected in genomic sequence and then discusses several popular statistical frameworks for integrated gene prediction on eukaryotic genome sequences.
Collapse
Affiliation(s)
- Tyler Alioto
- Centro Nacional de Análisis Genómico, Barcelona, Spain.
| |
Collapse
|
18
|
Brosch M, Saunders GI, Frankish A, Collins MO, Yu L, Wright J, Verstraten R, Adams DJ, Harrow J, Choudhary JS, Hubbard T. Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome. Genome Res 2011; 21:756-67. [PMID: 21460061 DOI: 10.1101/gr.114272.110] [Citation(s) in RCA: 87] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Recent advances in proteomic mass spectrometry (MS) offer the chance to marry high-throughput peptide sequencing to transcript models, allowing the validation, refinement, and identification of new protein-coding loci. We present a novel pipeline that integrates highly sensitive and statistically robust peptide spectrum matching with genome-wide protein-coding predictions to perform large-scale gene validation and discovery in the mouse genome for the first time. In searching an excess of 10 million spectra, we have been able to validate 32%, 17%, and 7% of all protein-coding genes, exons, and splice boundaries, respectively. Moreover, we present strong evidence for the identification of multiple alternatively spliced translations from 53 genes and have uncovered 10 entirely novel protein-coding genes, which are not covered in any mouse annotation data sources. One such novel protein-coding gene is a fusion protein that spans the Ins2 and Igf2 loci to produce a transcript encoding the insulin II and the insulin-like growth factor 2-derived peptides. We also report nine processed pseudogenes that have unique peptide hits, demonstrating, for the first time, that they are not just transcribed but are translated and are therefore resurrected into new coding loci. This work not only highlights an important utility for MS data in genome annotation but also provides unique insights into the gene structure and propagation in the mouse genome. All these data have been subsequently used to improve the publicly available mouse annotation available in both the Vega and Ensembl genome browsers (http://vega.sanger.ac.uk).
Collapse
Affiliation(s)
- Markus Brosch
- The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
19
|
Pokhriyal N, Ponts N, Harris EY, Le Roch KG, Lonardi S. Novel Gene Discovery in the Human Malaria Parasite using Nucleosome Positioning Data. COMPUTATIONAL SYSTEMS BIOINFORMATICS. COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2010; 9:124-135. [PMID: 25076982 PMCID: PMC4112967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Recent genome-wide studies on nucleosome positioning in model organisms have shown strong evidence that nucleosome landscapes in the proximity of protein-coding genes exhibit regular characteristic patterns. Here, we propose a computational framework to discover novel genes in the human malaria parasite genome P. falciparum using nucleosome positioning inferred from MAINE-seq data. We rely on a classifier trained on the nucleosome landscape profiles of experimentally verified genes, and then used to discover new genes (without considering the primary DNA sequence). Cross-validation experiments show that our classifier is very accurate. About two thirds of the locations reported by the classifier match experimentally determined expressed sequence tags in GenBank, for which no gene has been annotated in the human malaria parasite.
Collapse
Affiliation(s)
- N. Pokhriyal
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | - N. Ponts
- Department of Cell Biology and Neuroscience, University of California, Riverside, CA 92521, USA
| | - E. Y. Harris
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | - K. G. Le Roch
- Department of Cell Biology and Neuroscience, University of California, Riverside, CA 92521, USA
| | - S. Lonardi
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| |
Collapse
|
20
|
Luo L, Li H, Zhang L. ORF organization and gene recognition in the yeast genome. Comp Funct Genomics 2010; 4:318-28. [PMID: 18629282 PMCID: PMC2448446 DOI: 10.1002/cfg.292] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2002] [Revised: 03/03/2003] [Accepted: 03/10/2003] [Indexed: 11/10/2022] Open
Abstract
Some rules on gene recognition and ORF organization in the Saccharomyces cerevisiae genome are demonstrated by statistical analyses of sequence data. This study includes: (a) The random frame rule-that the six reading frames W1, W2, W3, C1, C2 and C3 in the double-stranded genome are randomly occupied by ORFs (related phenomena on ORF overlapping are also discussed). (b) The inhomogeneity rule-coding and non-coding ORFs differ in inhomogeneity of base composition in the three codon positions. By use of the inhomogeneity index (IHI), one can make a distinction between coding (IHI > 14) and non-coding (IHI < or = 14) ORFs at 95% accuracy. We find that 'spurious' ORFs (with IHI < or = 14) are distributed mainly in three classes of ORFs, namely, those with 'similarity to unknown proteins', those with 'no similarity', or 'questionable ORFs'. The total number of spurious ORFs (which are unlikely to be regarded as coding ORFs) is estimated to be 470. (c) The evaluation of ORF length distribution shows that below 200 amino acids the occurrence of ATG initiator ORFs is close to random.
Collapse
Affiliation(s)
- Liaofu Luo
- Laboratory of Theoretical Biophysics, Faculty of Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | | | | |
Collapse
|
21
|
Abstract
High-throughput DNA sequencing is increasing the amount of public complete genomes even though a precise gene catalogue for each organism is not yet available. In this context, computational gene finders play a key role in producing a first and cost-effective annotation. Nowadays a compilation of gene prediction tools has been made available to the scientific community and, despite the high number, they can be divided into two main categories: (1) ab initio and (2) evidence based. In the following, we will provide an overview of main methodologies to predict correct exon-intron structures of eukaryotic genes falling in such categories. We will take into account also new strategies that commonly refine ab initio predictions employing comparative genomics or other evidence such as expression data. Finally, we will briefly introduce metrics to in house evaluation of gene predictions in terms of sensitivity and specificity at nucleotide, exon, and gene levels as well.
Collapse
Affiliation(s)
- Ernesto Picardi
- Dipartimento di Biochimica e Biologia Molecolare E Quagliariello, University of Bari, Bari, Italy
| | | |
Collapse
|
22
|
Jost D, Everaers R. Genome wide application of DNA melting analysis. JOURNAL OF PHYSICS. CONDENSED MATTER : AN INSTITUTE OF PHYSICS JOURNAL 2009; 21:034108. [PMID: 21817253 DOI: 10.1088/0953-8984/21/3/034108] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Correspondences between functional and thermodynamic melting properties in a genome are being increasingly employed for ab initio gene finding and for the interpretation of the evolution of genomes. Here we present the first systematic genome wide comparison between biologically coding domains and thermodynamically stable regions. In particular, we develop statistical methods to estimate the reliability of the resulting predictions. Not surprisingly, we find that the success of the approach depends on the difference in GC content between the coding and the non-coding parts of the genome and on the percentage of coding base-pairs in the sequence. These prerequisites vary strongly between species, where we observe no systematic differences between eukaryotes and prokaryotes. We find a number of organisms in which the strong correlation of coding domains and thermodynamically stable regions allows us to identify putative exons or genes to complement existing approaches. In contrast to previous investigations along these lines we have not employed the Poland-Scheraga (PS) model of DNA melting but use the earlier Zimm-Bragg (ZB) model. The Ising-like form of the ZB model can be viewed as an approximation to the PS model, with averaged loop entropies included into the cooperative factor [Formula: see text]. This results in a speed-up by a factor of 20-100 compared to the Fixman-Freire algorithm for the solution of the PS model. We show that for genomic sequences the resulting systematic errors are negligible compared to the parameterization uncertainty of the models. We argue that for limited computing resources, available CPU power is better invested in broadening the statistical base for genomic investigations than in marginal improvements of the description of the physical melting behavior.
Collapse
Affiliation(s)
- Daniel Jost
- Laboratoire de Physique de l'École Normale Supérieure de Lyon, Université de Lyon, CNRS UMR 5672, 46 Allée d'Italie 69364 Lyon Cedex 07, France
| | | |
Collapse
|
23
|
De Bona F, Ossowski S, Schneeberger K, Rätsch G. Optimal spliced alignments of short sequence reads. Bioinformatics 2008; 24:i174-80. [PMID: 18689821 DOI: 10.1093/bioinformatics/btn300] [Citation(s) in RCA: 84] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Next generation sequencing technologies open exciting new possibilities for genome and transcriptome sequencing. While reads produced by these technologies are relatively short and error prone compared to the Sanger method their throughput is several magnitudes higher. To utilize such reads for transcriptome sequencing and gene structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. This represents a significant challenge given their short length and inherent high error rate. RESULTS We present a novel approach, called QPALMA, for computing accurate spliced alignments which takes advantage of the read's quality information as well as computational splice site predictions. Our method uses a training set of spliced reads with quality information and known alignments. It uses a large margin approach similar to support vector machines to estimate its parameters to maximize alignment accuracy. In computational experiments, we illustrate that the quality information as well as the splice site predictions help to improve the alignment quality. Finally, to facilitate mapping of massive amounts of sequencing data typically generated by the new technologies, we have combined our method with a fast mapping pipeline based on enhanced suffix arrays. Our algorithms were optimized and tested using reads produced with the Illumina Genome Analyzer for the model plant Arabidopsis thaliana. AVAILABILITY Datasets for training and evaluation, additional results and a stand-alone alignment tool implemented in C++ and python are available at http://www.fml.mpg.de/raetsch/projects/qpalma.
Collapse
Affiliation(s)
- Fabio De Bona
- Friedrich Miescher Laboratory, Max Planck Society, Spemannstr 39, 72076 Tübingen, Germany
| | | | | | | |
Collapse
|
24
|
Liu Q, Crammer K, Pereira FCN, Roos DS. Reranking candidate gene models with cross-species comparison for improved gene prediction. BMC Bioinformatics 2008; 9:433. [PMID: 18854050 PMCID: PMC2587481 DOI: 10.1186/1471-2105-9-433] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2008] [Accepted: 10/14/2008] [Indexed: 11/10/2022] Open
Abstract
Background Most gene finders score candidate gene models with state-based methods, typically HMMs, by combining local properties (coding potential, splice donor and acceptor patterns, etc). Competing models with similar state-based scores may be distinguishable with additional information. In particular, functional and comparative genomics datasets may help to select among competing models of comparable probability by exploiting features likely to be associated with the correct gene models, such as conserved exon/intron structure or protein sequence features. Results We have investigated the utility of a simple post-processing step for selecting among a set of alternative gene models, using global scoring rules to rerank competing models for more accurate prediction. For each gene locus, we first generate the K best candidate gene models using the gene finder Evigan, and then rerank these models using comparisons with putative orthologous genes from closely-related species. Candidate gene models with lower scores in the original gene finder may be selected if they exhibit strong similarity to probable orthologs in coding sequence, splice site location, or signal peptide occurrence. Experiments on Drosophila melanogaster demonstrate that reranking based on cross-species comparison outperforms the best gene models identified by Evigan alone, and also outperforms the comparative gene finders GeneWise and Augustus+. Conclusion Reranking gene models with cross-species comparison improves gene prediction accuracy. This straightforward method can be readily adapted to incorporate additional lines of evidence, as it requires only a ranked source of candidate gene models.
Collapse
Affiliation(s)
- Qian Liu
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
| | | | | | | |
Collapse
|
25
|
Abstract
BACKGROUND Computational gene prediction tools routinely generate large volumes of predicted coding exons (putative exons). One common limitation of these tools is the relatively low specificity due to the large amount of non-coding regions. METHODS A statistical approach is developed that largely improves the gene prediction specificity. The key idea is to utilize the evolutionary conservation principle relative to the coding exons. By first exploiting the homology between genomes of two related species, a probability model for the evolutionary conservation pattern of codons across different genomes is developed. A probability model for the dependency between adjacent codons/triplets is added to differentiate coding exons and random sequences. Finally, the log odds ratio is developed to classify putative exons into the group of coding exons and the group of non-coding regions. RESULTS The method was tested on pre-aligned human-mouse sequences where the putative exons are predicted by GENSCAN and TWINSCAN. The proposed method is able to improve the exon specificity by 73% and 32% respectively, while the loss of the sensitivity < or = 1%. The method also keeps 98% of RefSeq gene structures that are correctly predicted by TWINSCAN when removing 26% of predicted genes that are in non-coding regions. The estimated number of true exons in TWINSCAN's predictions is 157,070. The results and the executable codes can be downloaded from http://www.stat.purdue.edu/~jingwu/codon/ CONCLUSION The proposed method demonstrates an application of the evolutionary conservation principle to coding exons. It is a complementary method which can be used as an additional criteria to refine many existing gene predictions.
Collapse
Affiliation(s)
- Jing Wu
- Department of Statistics, Purdue University, 150 N, University Street, West Lafayette, IN 47906, USA.
| |
Collapse
|
26
|
Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 2008; 18:1979-90. [PMID: 18757608 DOI: 10.1101/gr.081612.108] [Citation(s) in RCA: 634] [Impact Index Per Article: 39.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We describe a new ab initio algorithm, GeneMark-ES version 2, that identifies protein-coding genes in fungal genomes. The algorithm does not require a predetermined training set to estimate parameters of the underlying hidden Markov model (HMM). Instead, the anonymous genomic sequence in question is used as an input for iterative unsupervised training. The algorithm extends our previously developed method tested on genomes of Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster. To better reflect features of fungal gene organization, we enhanced the intron submodel to accommodate sequences with and without branch point sites. This design enables the algorithm to work equally well for species with the kinds of variations in splicing mechanisms seen in the fungal phyla Ascomycota, Basidiomycota, and Zygomycota. Upon self-training, the intron submodel switches on in several steps to reach its full complexity. We demonstrate that the algorithm accuracy, both at the exon and the whole gene level, is favorably compared to the accuracy of gene finders that employ supervised training. Application of the new method to known fungal genomes indicates substantial improvement over existing annotations. By eliminating the effort necessary to build comprehensive training sets, the new algorithm can streamline and accelerate the process of annotation in a large number of fungal genome sequencing projects.
Collapse
|
27
|
Abstract
MOTIVATION Finding protein-coding genes in a newly determined genomic sequence is the first step toward understanding the content written in the genome. Sequences of transcripts of homologous genes, if available, can considerably improve accuracy of prediction of genes and their structures, compared with that without such knowledge. As protein sequences are generally better conserved than nucleotide sequences, remote homologs can be used as templates, extending the applicability of evidence-based gene recognition methods. However, no tool seems to have been developed so far to simultaneously map and align a number of protein sequences on mammalian-sized genomic sequence. RESULTS We have extended our computer program Spaln to accept protein sequences, as well as cDNA sequences, as queries. When the query and the target sequences are reasonably similar, e.g. between mammalian orthologs, Spaln runs one to two orders of magnitude faster than conventional approaches that rely on Blast search followed by dynamic-programming-based spliced alignment. Exon-level and gene-level accuracies of Spaln are significantly higher than those obtained by the best available methods of the same type, particularly when the query and the target are distantly related. AVAILABILITY Spaln is accessible online for a few species at http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user. The source code is available for free for academic users from the same site.
Collapse
Affiliation(s)
- Osamu Gotoh
- Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Yoshida Honmachi, Sakyo-ku, Kyoto 606-8501, Japan.
| |
Collapse
|
28
|
Kapustin Y, Souvorov A, Tatusova T, Lipman D. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol Direct 2008; 3:20. [PMID: 18495041 PMCID: PMC2440734 DOI: 10.1186/1745-6150-3-20] [Citation(s) in RCA: 244] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2008] [Accepted: 05/21/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The computation of accurate alignments of cDNA sequences against a genome is at the foundation of modern genome annotation pipelines. Several factors such as presence of paralogs, small exons, non-consensus splice signals, sequencing errors and polymorphic sites pose recognized difficulties to existing spliced alignment algorithms. RESULTS We describe a set of algorithms behind a tool called Splign for computing cDNA-to-Genome alignments. The algorithms include a high-performance preliminary alignment, a compartment identification based on a formally defined model of adjacent duplicated regions, and a refined sequence alignment. In a series of tests, Splign has produced more accurate results than other tools commonly used to compute spliced alignments, in a reasonable amount of time. CONCLUSION Splign's ability to deal with various issues complicating the spliced alignment problem makes it a helpful tool in eukaryotic genome annotation processes and alternative splicing studies. Its performance is enough to align the largest currently available pools of cDNA data such as the human EST set on a moderate-sized computing cluster in a matter of hours. The duplications identification (compartmentization) algorithm can be used independently in other areas such as the study of pseudogenes.
Collapse
Affiliation(s)
- Yuri Kapustin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20814, USA.
| | | | | | | |
Collapse
|
29
|
Abstract
As the number of sequenced genomes increases, the ability to deduce genome function becomes increasingly salient. For many genome sequences, the only annotation that will be available for the foreseeable future will be based on computational predictions and comparisons with functional elements in related species. Here we discuss computational approaches for automated genome-wide annotation of functional elements in mammalian genomes. These include methods for ab initio and comparative gene-structure predictions. Gene features such as intron splice sites, 3' untranslated regions, promoters, and cis-regulatory elements are discussed, as is a novel method for predicting DNaseI hypersensitive sites. Recent methodologies for predicting noncoding RNA genes, including microRNA genes and their targets, are also reviewed.
Collapse
Affiliation(s)
- Steven J M Jones
- Genome Sciences Centre, British Columbia Cancer Research Center, Vancouver, British Columbia, V5Z 1L3, Canada.
| |
Collapse
|
30
|
Sarje A, Aluru S. Parallel biological sequence alignments on the Cell Broadband Engine. ACTA ACUST UNITED AC 2008. [DOI: 10.1109/ipdps.2008.4536328] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
31
|
Gotoh O. A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence. Nucleic Acids Res 2008; 36:2630-8. [PMID: 18344523 PMCID: PMC2377433 DOI: 10.1093/nar/gkn105] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
The mapping and alignment of transcripts (cDNA, expressed sequence tag or amino acid sequences) onto a genomic sequence is a fundamental step for genome annotation, including gene finding and analyses of transcriptional activity, alternative splicing and nucleotide polymorphisms. As DNA sequence data of genomes and transcripts are accumulating at an unprecedented rate, steady improvement in accuracy, speed and space requirement in the computational tools for mapping/alignment is desired. We devised a multi-phase heuristic algorithm and implemented it in the development of the stand-alone computer program Spaln (space-efficient spliced alignment). Spaln is reasonably fast and space efficient; it requires <1 Gb of memory to map and align >120 000 Unigene sequences onto the unmasked whole human genome with a conventional computer, finishing the job in <6 h. With artificially introduced noise of various levels, Spaln significantly outperforms other leading alignment programs currently available with respect to the accuracy of mapped exon–intron structures. This performance is achieved without extensive learning procedures to adjust parameter values to a particular organism. According to the handiness and accuracy, Spaln may be used for studies on a wide area of genome analyses.
Collapse
Affiliation(s)
- Osamu Gotoh
- Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Yoshida Honmachi, Sakyo-ku, Kyoto 606-8501, Japan.
| |
Collapse
|
32
|
|
33
|
Schulze U, Hepp B, Ong CS, Rätsch G. PALMA: mRNA to genome alignments using large margin algorithms. Bioinformatics 2007; 23:1892-900. [PMID: 17537755 DOI: 10.1093/bioinformatics/btm275] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Despite many years of research on how to properly align sequences in the presence of sequencing errors, alternative splicing and micro-exons, the correct alignment of mRNA sequences to genomic DNA is still a challenging task. RESULTS We present a novel approach based on large margin learning that combines accurate splice site predictions with common sequence alignment techniques. By solving a convex optimization problem, our algorithm-called PALMA-tunes the parameters of the model such that true alignments score higher than other alignments. We study the accuracy of alignments of mRNAs containing artificially generated micro-exons to genomic DNA. In a carefully designed experiment, we show that our algorithm accurately identifies the intron boundaries as well as boundaries of the optimal local alignment. It outperforms all other methods: for 5702 artificially shortened EST sequences from Caenorhabditis elegans and human, it correctly identifies the intron boundaries in all except two cases. The best other method is a recently proposed method called exalin which misaligns 37 of the sequences. Our method also demonstrates robustness to mutations, insertions and deletions, retaining accuracy even at high noise levels. AVAILABILITY Datasets for training, evaluation and testing, additional results and a stand-alone alignment tool implemented in C++ and python are available at http://www.fml.mpg.de/raetsch/projects/palma
Collapse
Affiliation(s)
- Uta Schulze
- Friedrich Miescher Laboratory, Max Planck Society, Tübingen, Germany
| | | | | | | |
Collapse
|
34
|
Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 2007; 23:1061-7. [PMID: 17332020 DOI: 10.1093/bioinformatics/btm071] [Citation(s) in RCA: 1553] [Impact Index Per Article: 91.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The numbers of finished and ongoing genome projects are increasing at a rapid rate, and providing the catalog of genes for these new genomes is a key challenge. Obtaining a set of well-characterized genes is a basic requirement in the initial steps of any genome annotation process. An accurate set of genes is needed in order to learn about species-specific properties, to train gene-finding programs, and to validate automatic predictions. Unfortunately, many new genome projects lack comprehensive experimental data to derive a reliable initial set of genes. RESULTS In this study, we report a computational method, CEGMA (Core Eukaryotic Genes Mapping Approach), for building a highly reliable set of gene annotations in the absence of experimental data. We define a set of conserved protein families that occur in a wide range of eukaryotes, and present a mapping procedure that accurately identifies their exon-intron structures in a novel genomic sequence. CEGMA includes the use of profile-hidden Markov models to ensure the reliability of the gene structures. Our procedure allows one to build an initial set of reliable gene annotations in potentially any eukaryotic genome, even those in draft stages. AVAILABILITY Software and data sets are available online at http://korflab.ucdavis.edu/Datasets.
Collapse
Affiliation(s)
- Genis Parra
- UC Davis Genome Center, University of California Davis, Davis, CA 95616, USA
| | | | | |
Collapse
|
35
|
Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer RJ, Schölkopf B. Improving the Caenorhabditis elegans genome annotation using machine learning. PLoS Comput Biol 2007; 3:e20. [PMID: 17319737 PMCID: PMC1808025 DOI: 10.1371/journal.pcbi.0030020] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2006] [Accepted: 12/20/2006] [Indexed: 11/19/2022] Open
Abstract
For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.
Collapse
Affiliation(s)
- Gunnar Rätsch
- Friedrich Miescher Laboratory, Max Planck Society, Tübingen, Germany.
| | | | | | | | | | | | | |
Collapse
|
36
|
Bernal A, Crammer K, Hatzigeorgiou A, Pereira F. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput Biol 2007; 3:e54. [PMID: 17367206 PMCID: PMC1828702 DOI: 10.1371/journal.pcbi.0030054] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2006] [Accepted: 02/01/2007] [Indexed: 11/18/2022] Open
Abstract
Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns. We describe a new approach to statistical learning for sequence data that is broadly applicable to computational biology problems and that has experimentally demonstrated advantages over current hidden Markov model (HMM)-based methods for sequence analysis. The methods we describe in this paper, implemented in the CRAIG program, allow researchers to modularly specify and train sequence analysis models that combine a wide range of weakly informative features into globally optimal predictions. Our results for the gene prediction problem show significant improvements over existing ab initio gene predictors on a variety of tests, including the specially challenging ENCODE regions. Such improved predictions, particularly on initial and single exons, could benefit researchers who are seeking more accurate means of recognizing such important features as signal peptides and regulatory regions. More generally, we believe that our method, by combining the structure-describing capabilities of HMMs with the accuracy of margin-based classification methods, provides a general tool for statistical learning in biological sequences that will replace HMMs in any sequence modeling task for which there is annotated training data.
Collapse
Affiliation(s)
- Axel Bernal
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.
| | | | | | | |
Collapse
|
37
|
Gertz EM, Yu YK, Agarwala R, Schäffer AA, Altschul SF. Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol 2006; 4:41. [PMID: 17156431 PMCID: PMC1779365 DOI: 10.1186/1741-7007-4-41] [Citation(s) in RCA: 331] [Impact Index Per Article: 18.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2006] [Accepted: 12/07/2006] [Indexed: 11/29/2022] Open
Abstract
Background TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server. Results We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy. Conclusion TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms.
Collapse
Affiliation(s)
- E Michael Gertz
- National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, USA
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, USA
| | - Richa Agarwala
- National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, USA
| | - Alejandro A Schäffer
- National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, USA
| | - Stephen F Altschul
- National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, USA
| |
Collapse
|
38
|
Wong ESW, Young LJ, Papenfuss AT, Belov K. In silico identification of opossum cytokine genes suggests the complexity of the marsupial immune system rivals that of eutherian mammals. Immunome Res 2006; 2:4. [PMID: 17094811 PMCID: PMC1660534 DOI: 10.1186/1745-7580-2-4] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2006] [Accepted: 11/10/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cytokines are small proteins that regulate immunity in vertebrate species. Marsupial and eutherian mammals last shared a common ancestor more than 180 million years ago, so it is not surprising that attempts to isolate many key marsupial cytokines using traditional laboratory techniques have been unsuccessful. This paucity of molecular data has led some authors to suggest that the marsupial immune system is 'primitive' and not on par with the sophisticated immune system of eutherian (placental) mammals. RESULTS The sequencing of the first marsupial genome has allowed us to identify highly divergent immune genes. We used gene prediction methods that incorporate the identification of gene location using BLAST, SYNTENY + BLAST and HMMER to identify 23 key marsupial immune genes, including IFN-gamma, IL-2, IL-4, IL-6, IL-12 and IL-13, in the genome of the grey short-tailed opossum (Monodelphis domestica). Many of these genes were not predicted in the publicly available automated annotations. CONCLUSION The power of this approach was demonstrated by the identification of orthologous cytokines between marsupials and eutherians that share only 30% identity at the amino acid level. Furthermore, the presence of key immunological genes suggests that marsupials do indeed possess a sophisticated immune system, whose function may parallel that of eutherian mammals.
Collapse
Affiliation(s)
- Emily SW Wong
- Faculty of Veterinary Science, University of Sydney, Sydney, New South Wales, Australia
| | - Lauren J Young
- School of Chemical and Biomedical Sciences, Central Queensland University, Rockhampton, Queensland, Australia
| | - Anthony T Papenfuss
- Division of Bioinformatics, The Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, Australia
| | - Katherine Belov
- Faculty of Veterinary Science, University of Sydney, Sydney, New South Wales, Australia
| |
Collapse
|
39
|
Abstract
We introduce a new system, called shortHMM, for predicting exons, which predicts individual exons using two related genomes. In this system, we build a hidden semi-Markov model to identify exons. In the hidden Markov model, we propose joint probability models of nucleotides in introns, splice sites, 5'UTR, 3'UTR, and intergenic regions by exploiting the homology between related genomes. In order to reduce the false positive rate of the hidden Markov model, we develop a screening process which is able to identify intergenic regions. We then build a classifier by combining the statistics from the hidden Markov model and the screening process. We implement shortHMM on human-mouse sequence alignments. The source codes are available at < www.stat.purdue.edu/ jingwu/hmm >. Compared to TWINSCAN and SLAM, shortHMM is substantially more powerful in identifying AT-rich RefSeq exons (8% more AT-rich RefSeq exons were predicted), as well as slightly more powerful in identifying RefSeq exons (3-10% more RefSeq exons were predicted), at a similar or lower false positive rate, with less computing time and with less memory usage. Last, shortHMM is also capable of finding new potential exons.
Collapse
Affiliation(s)
- Jing Wu
- Department of Statistics, Purdue University, West Lafayette, Indiana 47906, USA.
| | | |
Collapse
|
40
|
Abstract
Gene structure prediction is one of the most important problems in computational molecular biology. It involves two steps: the first is finding the evidence (e.g., predicting splice sites) and the second is interpreting the evidence, that is, trying to determine the whole gene structure by assembling its pieces. In this paper, we suggest a combinatorial solution to the second step, which is also referred to as the "Exon Assembly Problem." We use a similarity-based approach that aims to produce a single gene structure based on similarities to a known homologous sequence. We target the sparse case, where filtering has been applied to the data, resulting in a set of O(n) candidate exon blocks. Our algorithm yields an O(n(2) square root of n) solution.
Collapse
Affiliation(s)
- Carmel Kent
- Department of Computer Science, University of Haifa, Haifa Israel.
| | | | | |
Collapse
|
41
|
Schultz AK, Zhang M, Leitner T, Kuiken C, Korber B, Morgenstern B, Stanke M. A jumping profile Hidden Markov Model and applications to recombination sites in HIV and HCV genomes. BMC Bioinformatics 2006; 7:265. [PMID: 16716226 PMCID: PMC1525204 DOI: 10.1186/1471-2105-7-265] [Citation(s) in RCA: 75] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2005] [Accepted: 05/22/2006] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Jumping alignments have recently been proposed as a strategy to search a given multiple sequence alignment A against a database. Instead of comparing a database sequence S to the multiple alignment or profile as a whole, S is compared and aligned to individual sequences from A. Within this alignment, S can jump between different sequences from A, so different parts of S can be aligned to different sequences from the input multiple alignment. This approach is particularly useful for dealing with recombination events. RESULTS We developed a jumping profile Hidden Markov Model (jpHMM), a probabilistic generalization of the jumping-alignment approach. Given a partition of the aligned input sequence family into known sequence subtypes, our model can jump between states corresponding to these different subtypes, depending on which subtype is locally most similar to a database sequence. Jumps between different subtypes are indicative of intersubtype recombinations. We applied our method to a large set of genome sequences from human immunodeficiency virus (HIV) and hepatitis C virus (HCV) as well as to simulated recombined genome sequences. CONCLUSION Our results demonstrate that jumps in our jumping profile HMM often correspond to recombination breakpoints; our approach can therefore be used to detect recombinations in genomic sequences. The recombination breakpoints identified by jpHMM were found to be significantly more accurate than breakpoints defined by traditional methods based on comparing single representative sequences.
Collapse
Affiliation(s)
- Anne-Kathrin Schultz
- Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Ming Zhang
- Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
| | - Thomas Leitner
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
| | - Carla Kuiken
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
| | - Bette Korber
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
- The Santa Fe Institute, Santa Fe, NM 87501, USA
| | - Burkhard Morgenstern
- Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Mario Stanke
- Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| |
Collapse
|
42
|
Chatterji S, Pachter L. Reference based annotation with GeneMapper. Genome Biol 2006; 7:R29. [PMID: 16600017 PMCID: PMC1557983 DOI: 10.1186/gb-2006-7-4-r29] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2005] [Revised: 02/03/2006] [Accepted: 03/03/2006] [Indexed: 11/22/2022] Open
Abstract
We introduce GeneMapper, a program for transferring annotations from a well annotated genome to other genomes. Drawing on high quality curated annotations, GeneMapper enables rapid and accurate annotation of newly sequenced genomes and is suitable for both finished and draft genomes. GeneMapper uses a profile based approach for mapping genes into multiple species, improving upon the standard pairwise approach. GeneMapper is freely available for academic use.
Collapse
Affiliation(s)
- Sourav Chatterji
- Department of Computer Science, University of California at Berkeley, Berkeley, CA, 94720, USA
| | - Lior Pachter
- Department of Mathematics, University of California at Berkeley, Berkeley, CA 94720, USA
| |
Collapse
|
43
|
Ko P, Narayanan M, Kalyanaraman A, Aluru S. Space-conserving optimal DNA-protein alignment. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2006:80-8. [PMID: 16448002 DOI: 10.1109/csb.2004.1332420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
DNA-protein alignment algorithms can be used to discover coding sequences in a genomic sequence, if the corresponding protein derivatives are known. They can also be used to identify potential coding sequences of a newly sequenced genome, by using proteins from related species. Previously known algorithms either solve a simplified formulation, or sacrifice optimality to achieve practical implementation. In this paper, we present a comprehensive formulation of the DNA-protein alignment problem, and an algorithm to compute the optimal alignment in O(mn) time using only four tables of size (m + 1) x (n + 1), where m and n are the lengths of the DNA and protein sequences, respectively. We also developed a Protein and DNA Alignment program PanDA that implements the proposed solution. Experimental results indicate that our algorithm produces high quality alignments.
Collapse
Affiliation(s)
- Pang Ko
- Department of Electrical and Computer Engineering, Iowa State University, USA.
| | | | | | | |
Collapse
|
44
|
Bekaert M, Richard H, Prum B, Rousset JP. Identification of programmed translational -1 frameshifting sites in the genome of Saccharomyces cerevisiae. Genome Res 2006; 15:1411-20. [PMID: 16204194 PMCID: PMC1240084 DOI: 10.1101/gr.4258005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Frameshifting is a recoding event that allows the expression of two polypeptides from the same mRNA molecule. Most recoding events described so far are used by viruses and transposons to express their replicase protein. The very few number of cellular proteins known to be expressed by a -1 ribosomal frameshifting has been identified by chance. The goal of the present work was to set up a systematic strategy, based on complementary bioinformatics, molecular biology, and functional approaches, without a priori knowledge of the mechanism involved. Two independent methods were devised. The first looks for genomic regions in which two ORFs, each carrying a protein pattern, are in a frameshifted arrangement. The second uses Hidden Markov Models and likelihood in a two-step approach. When this strategy was applied to the Saccharomyces cerevisiae genome, 189 candidate regions were found, of which 58 were further functionally investigated. Twenty-eight of them expressed a full-length mRNA covering the two ORFs, and 11 showed a -1 frameshift efficiency varying from 5% to 13% (50-fold higher than background), some of which corresponds to genes with known functions. From other ascomycetes, four frameshifted ORFs are found fully conserved. Strikingly, most of the candidates do not display a classical viral-like frameshift signal and would have escaped a search based on current models of frameshifting. These results strongly suggest that -1 frameshifting might be more widely distributed than previously thought.
Collapse
Affiliation(s)
- Michaël Bekaert
- Institut de Génétique et Microbiologie CNRS UMR 8621, Université Paris-Sud, 91405 Orsay Cedex, France
| | | | | | | |
Collapse
|
45
|
Volpe JM, Cowell LG, Kepler TB. SoDA: implementation of a 3D alignment algorithm for inference of antigen receptor recombinations. Bioinformatics 2005; 22:438-44. [PMID: 16357034 DOI: 10.1093/bioinformatics/btk004] [Citation(s) in RCA: 99] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The antigen receptors of adaptive immunity-T-cell receptors and immunoglobulins-are encoded by genes assembled stochastically from combinatorial libraries of gene segments. Immunoglobulin genes then experience further diversification through hypermutation. Analysis of the somatic genetics of the immune response depends explicitly on inference of the details of the recombinatorial process giving rise to each of the participating antigen receptor genes. We have developed a dynamic programming algorithm to perform this reconstruction and have implemented it as web-accessible software called SoDA (Somatic Diversification Analysis). RESULTS We tested SoDA against a set of 120 artificial immunoglobulin sequences generated by simulation of recombination and compared the results with two other widely used programs. SoDA inferred the correct gene segments more frequently than the other two programs. We further tested these programs using 30 human immunoglobulin genes from Genbank and here highlight instances where the recombinations inferred by the three programs differ. SoDA appears generally to find more likely recombinations.
Collapse
Affiliation(s)
- Joseph M Volpe
- Computational Biology and Bioinformatics Program, Duke University Medical Center Box 90090 Duke University, Durham, NC 27708-0090, USA
| | | | | |
Collapse
|
46
|
Abstract
Driven by competition, automation, and technology, the genomics community has far exceeded its ambition to sequence the human genome by 2005. By analyzing mammalian genomes, we have shed light on the history of our DNA sequence, determined that alternatively spliced RNAs and retroposed pseudogenes are incredibly abundant, and glimpsed the apparently huge number of non-coding RNAs that play significant roles in gene regulation. Ultimately, genome science is likely to provide comprehensive catalogs of these elements. However, the methods we have been using for most of the last 10 years will not yield even one complete open reading frame (ORF) for every gene--the first plateau on the long climb toward a comprehensive catalog. These strategies--sequencing randomly selected cDNA clones, aligning protein sequences identified in other organisms, sequencing more genomes, and manual curation--will have to be supplemented by large-scale amplification and sequencing of specific predicted mRNAs. The steady improvements in gene prediction that have occurred over the last 10 years have increased the efficacy of this approach and decreased its cost. In this Perspective, I review the state of gene prediction roughly 10 years ago, summarize the progress that has been made since, argue that the primary ORF identification methods we have relied on so far are inadequate, and recommend a path toward completing the Catalog of Protein Coding Genes, Version 1.0.
Collapse
Affiliation(s)
- Michael R Brent
- Laboratory for Computational Genomics and Department of Computer Science, Washington University, St. Louis, Missouri 63130, USA.
| |
Collapse
|
47
|
Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 2005; 33:6494-506. [PMID: 16314312 PMCID: PMC1298918 DOI: 10.1093/nar/gki937] [Citation(s) in RCA: 546] [Impact Index Per Article: 28.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2005] [Revised: 10/12/2005] [Accepted: 10/12/2005] [Indexed: 11/25/2022] Open
Abstract
Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. However, genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools tuned up for previously studied species are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Gene identification methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. Conventional statistical ab initio methods require large training sets of validated genes for estimating gene model parameters. In practice, neither one of these types of data may be available in sufficient amount until rather late stages of the novel genome sequencing. Nevertheless, we have shown that gene finding in eukaryotic genomes could be carried out in parallel with statistical models estimation directly from yet anonymous genomic DNA. The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Rounds of genomic sequence labeling into coding and non-coding regions are followed by the rounds of model parameters estimation. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step. Several novel genomes have been analyzed and biologically interesting findings are discussed. Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification.
Collapse
Affiliation(s)
- Alexandre Lomsadze
- School of Biology, Georgia Institute of TechnologyAtlanta, GA 30332-0230, USA
| | | | - Yury O. Chernoff
- School of Biology, Georgia Institute of TechnologyAtlanta, GA 30332-0230, USA
| | - Mark Borodovsky
- School of Biology, Georgia Institute of TechnologyAtlanta, GA 30332-0230, USA
- Department of Biomedical Engineering, Georgia Institute of TechnologyAtlanta, GA 30332-0535, USA
| |
Collapse
|
48
|
Churbanov A, Pauley M, Quest D, Ali H. A method of precise mRNA/DNA homology-based gene structure prediction. BMC Bioinformatics 2005; 6:261. [PMID: 16242044 PMCID: PMC1274302 DOI: 10.1186/1471-2105-6-261] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2005] [Accepted: 10/21/2005] [Indexed: 11/29/2022] Open
Abstract
Background Accurate and automatic gene finding and structural prediction is a common problem in bioinformatics, and applications need to be capable of handling non-canonical splice sites, micro-exons and partial gene structure predictions that span across several genomic clones. Results We present a mRNA/DNA homology based gene structure prediction tool, GIGOgene. We use a new affine gap penalty splice-enhanced global alignment algorithm running in linear memory for a high quality annotation of splice sites. Our tool includes a novel algorithm to assemble partial gene structure predictions using interval graphs. GIGOgene exhibited a sensitivity of 99.08% and a specificity of 99.98% on the Genie learning set, and demonstrated a higher quality of gene structural prediction when compared to Sim4, est2genome, Spidey, Galahad and BLAT, including when genes contained micro-exons and non-canonical splice sites. GIGOgene showed an acceptable loss of prediction quality when confronted with a noisy Genie learning set simulating ESTs. Conclusion GIGOgene shows a higher quality of gene structure prediction for mRNA/DNA spliced alignment when compared to other available tools.
Collapse
Affiliation(s)
- Alexander Churbanov
- Department of Computer Science, College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182-0116, USA
| | - Mark Pauley
- Department of Computer Science, College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182-0116, USA
| | - Daniel Quest
- Department of Computer Science, College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182-0116, USA
| | - Hesham Ali
- Department of Computer Science, College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182-0116, USA
| |
Collapse
|
49
|
Wang Z, Chen Y, Li Y. A brief review of computational gene prediction methods. GENOMICS PROTEOMICS & BIOINFORMATICS 2005; 2:216-21. [PMID: 15901250 PMCID: PMC5187414 DOI: 10.1016/s1672-0229(04)02028-5] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
With the development of genome sequencing for many organisms, more and more raw sequences need to be annotated. Gene prediction by computational methods for finding the location of protein coding regions is one of the essential issues in bioinformatics. Two classes of methods are generally adopted: similarity based searches and ab initio prediction. Here, we review the development of gene prediction methods, summarize the measures for evaluating predictor quality, highlight open problems in this area, and discuss future research directions.
Collapse
Affiliation(s)
- Zhuo Wang
- Biomedical Instrument Institute, Shanghai Jiaotong University, Shanghai 200030, China
- Shanghai Center for Bioinformation Technology, Shanghai 200035, China
- Corresponding authors.
| | - Yazhu Chen
- Biomedical Instrument Institute, Shanghai Jiaotong University, Shanghai 200030, China
| | - Yixue Li
- Shanghai Center for Bioinformation Technology, Shanghai 200035, China
- Corresponding authors.
| |
Collapse
|
50
|
Gouret P, Vitiello V, Balandraud N, Gilles A, Pontarotti P, Danchin EGJ. FIGENIX: intelligent automation of genomic annotation: expertise integration in a new software platform. BMC Bioinformatics 2005; 6:198. [PMID: 16083500 PMCID: PMC1188056 DOI: 10.1186/1471-2105-6-198] [Citation(s) in RCA: 103] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2005] [Accepted: 08/05/2005] [Indexed: 11/24/2022] Open
Abstract
Background Two of the main objectives of the genomic and post-genomic era are to structurally and functionally annotate genomes which consists of detecting genes' position and structure, and inferring their function (as well as of other features of genomes). Structural and functional annotation both require the complex chaining of numerous different software, algorithms and methods under the supervision of a biologist. The automation of these pipelines is necessary to manage huge amounts of data released by sequencing projects. Several pipelines already automate some of these complex chaining but still necessitate an important contribution of biologists for supervising and controlling the results at various steps. Results Here we propose an innovative automated platform, FIGENIX, which includes an expert system capable to substitute to human expertise at several key steps. FIGENIX currently automates complex pipelines of structural and functional annotation under the supervision of the expert system (which allows for example to make key decisions, check intermediate results or refine the dataset). The quality of the results produced by FIGENIX is comparable to those obtained by expert biologists with a drastic gain in terms of time costs and avoidance of errors due to the human manipulation of data. Conclusion The core engine and expert system of the FIGENIX platform currently handle complex annotation processes of broad interest for the genomic community. They could be easily adapted to new, or more specialized pipelines, such as for example the annotation of miRNAs, the classification of complex multigenic families, annotation of regulatory elements and other genomic features of interest.
Collapse
Affiliation(s)
- Philippe Gouret
- Phylogenomics Laboratory. EA 3781 EGEE (Evolution, Genome, Environment), Université de Provence, Case 36, Pl. V. Hugo, 13331 Marseille Cedex 03. France
| | - Vérane Vitiello
- Phylogenomics Laboratory. EA 3781 EGEE (Evolution, Genome, Environment), Université de Provence, Case 36, Pl. V. Hugo, 13331 Marseille Cedex 03. France
| | - Nathalie Balandraud
- Phylogenomics Laboratory. EA 3781 EGEE (Evolution, Genome, Environment), Université de Provence, Case 36, Pl. V. Hugo, 13331 Marseille Cedex 03. France
| | - André Gilles
- Phylogenomics Laboratory. EA 3781 EGEE (Evolution, Genome, Environment), Université de Provence, Case 36, Pl. V. Hugo, 13331 Marseille Cedex 03. France
| | - Pierre Pontarotti
- Phylogenomics Laboratory. EA 3781 EGEE (Evolution, Genome, Environment), Université de Provence, Case 36, Pl. V. Hugo, 13331 Marseille Cedex 03. France
| | - Etienne GJ Danchin
- Phylogenomics Laboratory. EA 3781 EGEE (Evolution, Genome, Environment), Université de Provence, Case 36, Pl. V. Hugo, 13331 Marseille Cedex 03. France
- AFMB-UMR 6098- CNRS - U1 - U2 Glycogenomics and Biomedical Structural Biology Case 932, 163 Avenue de Luminy 13288 Marseille cedex 09, France
| |
Collapse
|