Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Gelfand MS, Mironov AA, Pevzner PA. Gene recognition via spliced sequence alignment. Proc Natl Acad Sci U S A 1996;93:9061-6. [PMID: 8799154 PMCID: PMC38595 DOI: 10.1073/pnas.93.17.9061] [Citation(s) in RCA: 192] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open

For:	Gelfand MS, Mironov AA, Pevzner PA. Gene recognition via spliced sequence alignment. Proc Natl Acad Sci U S A 1996;93:9061-6. [PMID: 8799154 PMCID: PMC38595 DOI: 10.1073/pnas.93.17.9061] [Citation(s) in RCA: 192] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open

Number

Cited by Other Article(s)

Dvorkina T, Bankevich A, Sorokin A, Yang F, Adu-Oppong B, Williams R, Turner K, Pevzner PA. ORFograph: search for novel insecticidal protein genes in genomic and metagenomic assembly graphs. MICROBIOME 2021;9:149. [PMID: 34183047 PMCID: PMC8240309 DOI: 10.1186/s40168-021-01092-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 05/11/2021] [Indexed: 05/07/2023]

Abstract

BACKGROUND

Since the prolonged use of insecticidal proteins has led to toxin resistance, it is important to search for novel insecticidal protein genes (IPGs) that are effective in controlling resistant insect populations. IPGs are usually encoded in the genomes of entomopathogenic bacteria, especially in large plasmids in strains of the ubiquitous soil bacteria, Bacillus thuringiensis (Bt). Since there are often multiple similar IPGs encoded by such plasmids, their assemblies are typically fragmented and many IPGs are scattered through multiple contigs. As a result, existing gene prediction tools (that analyze individual contigs) typically predict partial rather than complete IPGs, making it difficult to conduct downstream IPG engineering efforts in agricultural genomics.

METHODS

Although it is difficult to assemble IPGs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding a single IPG.

RESULTS

We describe ORFograph, a pipeline for predicting IPGs in assembly graphs, benchmark it on (meta)genomic datasets, and discover nearly a hundred novel IPGs. This work shows that graph-aware gene prediction tools enable the discovery of greater diversity of IPGs from (meta)genomes.

CONCLUSIONS

We demonstrated that analysis of the assembly graphs reveals novel candidate IPGs. ORFograph identified both already known genes "hidden" in assembly graphs and potential novel IPGs that evaded existing tools for IPG identification. As ORFograph is fast, one could imagine a pipeline that processes many (meta)genomic assembly graphs to identify even more novel IPGs for phenotypic testing than would previously be inaccessible by traditional gene-finding methods. While here we demonstrated the results of ORFograph only for IPGs, the proposed approach can be generalized to any class of genes. Video abstract.

Collapse

Brůna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform 2021;3:lqaa108. [PMID: 33575650 PMCID: PMC7787252 DOI: 10.1093/nargab/lqaa108] [Citation(s) in RCA: 510] [Impact Index Per Article: 170.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Revised: 11/26/2020] [Accepted: 12/20/2020] [Indexed: 12/13/2022] Open

Levy Karin E, Mirdita M, Söding J. MetaEuk-sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics. MICROBIOME 2020;8:48. [PMID: 32245390 PMCID: PMC7126354 DOI: 10.1186/s40168-020-00808-x] [Citation(s) in RCA: 86] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2019] [Accepted: 02/14/2020] [Indexed: 05/10/2023]

Abstract

BACKGROUND

Metagenomics is revolutionizing the study of microorganisms and their involvement in biological, biomedical, and geochemical processes, allowing us to investigate by direct sequencing a tremendous diversity of organisms without the need for prior cultivation. Unicellular eukaryotes play essential roles in most microbial communities as chief predators, decomposers, phototrophs, bacterial hosts, symbionts, and parasites to plants and animals. Investigating their roles is therefore of great interest to ecology, biotechnology, human health, and evolution. However, the generally lower sequencing coverage, their more complex gene and genome architectures, and a lack of eukaryote-specific experimental and computational procedures have kept them on the sidelines of metagenomics.

RESULTS

MetaEuk is a toolkit for high-throughput, reference-based discovery, and annotation of protein-coding genes in eukaryotic metagenomic contigs. It performs fast searches with 6-frame-translated fragments covering all possible exons and optimally combines matches into multi-exon proteins. We used a benchmark of seven diverse, annotated genomes to show that MetaEuk is highly sensitive even under conditions of low sequence similarity to the reference database. To demonstrate MetaEuk's power to discover novel eukaryotic proteins in large-scale metagenomic data, we assembled contigs from 912 samples of the Tara Oceans project. MetaEuk predicted >12,000,000 protein-coding genes in 8 days on ten 16-core servers. Most of the discovered proteins are highly diverged from known proteins and originate from very sparsely sampled eukaryotic supergroups.

CONCLUSION

The open-source (GPLv3) MetaEuk software (https://github.com/soedinglab/metaeuk) enables large-scale eukaryotic metagenomics through reference-based, sensitive taxonomic and functional annotation. Video abstract.

Collapse

Armstrong J, Fiddes IT, Diekhans M, Paten B. Whole-Genome Alignment and Comparative Annotation. Annu Rev Anim Biosci 2019;7:41-64. [PMID: 30379572 PMCID: PMC6450745 DOI: 10.1146/annurev-animal-020518-115005] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]

Kou Q, Wu S, Tolic N, Paša-Tolic L, Liu Y, Liu X. A mass graph-based approach for the identification of modified proteoforms using top-down tandem mass spectra. Bioinformatics 2018;33:1309-1316. [PMID: 28453668 DOI: 10.1093/bioinformatics/btw806] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2016] [Accepted: 12/15/2016] [Indexed: 11/14/2022] Open

Chowdhury B, Garai A, Garai G. An optimized approach for annotation of large eukaryotic genomic sequences using genetic algorithm. BMC Bioinformatics 2017;18:460. [PMID: 29065853 PMCID: PMC5655831 DOI: 10.1186/s12859-017-1874-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 10/17/2017] [Indexed: 01/06/2023] Open

Singh A, Mishra A, Khosravi A, Khandelwal G, Jayaram B. Physico-chemical fingerprinting of RNA genes. Nucleic Acids Res 2017;45:e47. [PMID: 27932456 PMCID: PMC5397174 DOI: 10.1093/nar/gkw1236] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2016] [Accepted: 11/29/2016] [Indexed: 12/13/2022] Open

Bermudez-Santana CI. APLICACIONES DE LA BIOINFORMÁTICA EN LA MEDICINA: EL GENOMA HUMANO. ¿CÓMO PODEMOS VER TANTO DETALLE? ACTA BIOLÓGICA COLOMBIANA 2016. [DOI: 10.15446/abc.v21n1supl.51233] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open

Podicheti R, Mockaitis K. FEATnotator: A tool for integrated annotation of sequence features and variation, facilitating interpretation in genomics experiments. Methods 2015;79-80:11-7. [PMID: 25934264 DOI: 10.1016/j.ymeth.2015.04.028] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2014] [Revised: 03/25/2015] [Accepted: 04/22/2015] [Indexed: 11/16/2022] Open

Adi SS, Ferreira CE. Syntenic global alignment and its application to the gene prediction problem. JOURNAL OF THE BRAZILIAN COMPUTER SOCIETY 2013. [DOI: 10.1007/s13173-013-0115-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]

Chakraborty S. A fragmented alignment method detects a putative phosphorylation site and a putative BRC repeat in the Drosophila melanogaster BRCA2 protein. F1000Res 2013;2:143. [PMID: 24627786 PMCID: PMC3924952 DOI: 10.12688/f1000research.2-143.v2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/07/2013] [Indexed: 11/28/2022] Open

Abstract

Mutations in the BRCA2 tumor suppressor protein leave individuals susceptible to breast, ovarian and other cancers. The BRCA2 protein is a critical component of the DNA repair pathways in eukaryotes, and also plays an integral role in fostering genomic variability through meiotic recombination. Although present in many eukaryotes, as a whole the BRCA2 gene is weakly conserved. Conserved fragments of 30 amino acids (BRC repeats), which mediate interactions with the recombinase RAD51, helped detect orthologs of this protein in other organisms. The carboxy-terminal of the human BRCA2 has been shown to be phosphorylated by checkpoint kinases (Chk1/Chk2) at T3387, which regulate the sequestration of RAD51 on DNA damage. However, apart from three BRC repeats, the Drosophila melanogaster gene has not been annotated and associated with other functionally relevant sequence fragments in human BRCA2. In the current work, the carboxy-terminal phosphorylation threonine site (E=9.1e-4) and a new BRC repeat (E=17e-4) in D. melanogaster has been identified, using a fragmented alignment methodology (FRAGAL). In a similar study, FRAGAL has also identified a novel half-a- tetratricopeptide (HAT) motif (E=11e-4), a helical repeat motif implicated in various aspects of RNA metabolism, in Utp6 from yeast. The characteristic three aromatic residues with conserved spacing are observed in this new HAT repeat, further strengthening my claim. The reference and target sequences are sliced into overlapping fragments of equal parameterized lengths. All pairs of fragments in the reference and target proteins are aligned, and the gap penalties are adjusted to discourage gaps in the middle of the alignment. The results of the best matches are sorted based on differing criteria to aid the detection of known and putative sequences. The source code for FRAGAL results on these sequences is available at https://github.com/sanchak/FragalCode, while the database can be accessed at www.sanchak.com/fragal.html.

Collapse

Wu YW, Rho M, Doak TG, Ye Y. Stitching gene fragments with a network matching algorithm improves gene assembly for metagenomics. Bioinformatics 2013;28:i363-i369. [PMID: 22962453 PMCID: PMC3436815 DOI: 10.1093/bioinformatics/bts388] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open

Goodswen SJ, Kennedy PJ, Ellis JT. Evaluating high-throughput ab initio gene finders to discover proteins encoded in eukaryotic pathogen genomes missed by laboratory techniques. PLoS One 2012;7:e50609. [PMID: 23226328 PMCID: PMC3511556 DOI: 10.1371/journal.pone.0050609] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2012] [Accepted: 10/24/2012] [Indexed: 11/25/2022] Open

DNA-energetics-based analyses suggest additional genes in prokaryotes. J Biosci 2012;37:433-44. [PMID: 22750981 DOI: 10.1007/s12038-012-9221-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]

Goodswen SJ, Kennedy PJ, Ellis JT. A guide to in silico vaccine discovery for eukaryotic pathogens. Brief Bioinform 2012;14:753-74. [PMID: 23097412 DOI: 10.1093/bib/bbs066] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Iwata H, Gotoh O. Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res 2012;40:e161. [PMID: 22848105 PMCID: PMC3488211 DOI: 10.1093/nar/gks708] [Citation(s) in RCA: 113] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Alioto T. Gene prediction. Methods Mol Biol 2012;855:175-201. [PMID: 22407709 DOI: 10.1007/978-1-61779-582-4_6] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]

Brosch M, Saunders GI, Frankish A, Collins MO, Yu L, Wright J, Verstraten R, Adams DJ, Harrow J, Choudhary JS, Hubbard T. Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome. Genome Res 2011;21:756-67. [PMID: 21460061 DOI: 10.1101/gr.114272.110] [Citation(s) in RCA: 87] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]

Pokhriyal N, Ponts N, Harris EY, Le Roch KG, Lonardi S. Novel Gene Discovery in the Human Malaria Parasite using Nucleosome Positioning Data. COMPUTATIONAL SYSTEMS BIOINFORMATICS. COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2010;9:124-135. [PMID: 25076982 PMCID: PMC4112967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]

Luo L, Li H, Zhang L. ORF organization and gene recognition in the yeast genome. Comp Funct Genomics 2010;4:318-28. [PMID: 18629282 PMCID: PMC2448446 DOI: 10.1002/cfg.292] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2002] [Revised: 03/03/2003] [Accepted: 03/10/2003] [Indexed: 11/10/2022] Open

Picardi E, Pesole G. Computational methods for ab initio and comparative gene finding. Methods Mol Biol 2010;609:269-84. [PMID: 20221925 DOI: 10.1007/978-1-60327-241-4_16] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]

Jost D, Everaers R. Genome wide application of DNA melting analysis. JOURNAL OF PHYSICS. CONDENSED MATTER : AN INSTITUTE OF PHYSICS JOURNAL 2009;21:034108. [PMID: 21817253 DOI: 10.1088/0953-8984/21/3/034108] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]

De Bona F, Ossowski S, Schneeberger K, Rätsch G. Optimal spliced alignments of short sequence reads. Bioinformatics 2008;24:i174-80. [PMID: 18689821 DOI: 10.1093/bioinformatics/btn300] [Citation(s) in RCA: 84] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Liu Q, Crammer K, Pereira FCN, Roos DS. Reranking candidate gene models with cross-species comparison for improved gene prediction. BMC Bioinformatics 2008;9:433. [PMID: 18854050 PMCID: PMC2587481 DOI: 10.1186/1471-2105-9-433] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2008] [Accepted: 10/14/2008] [Indexed: 11/10/2022] Open

Wu J. Improving the specificity of exon prediction using comparative genomics. BMC Genomics 2008;9 Suppl 2:S13. [PMID: 18831778 PMCID: PMC2559877 DOI: 10.1186/1471-2164-9-s2-s13] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 2008;18:1979-90. [PMID: 18757608 DOI: 10.1101/gr.081612.108] [Citation(s) in RCA: 634] [Impact Index Per Article: 39.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Gotoh O. Direct mapping and alignment of protein sequences onto genomic sequence. ACTA ACUST UNITED AC 2008;24:2438-44. [PMID: 18728043 DOI: 10.1093/bioinformatics/btn460] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Kapustin Y, Souvorov A, Tatusova T, Lipman D. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol Direct 2008;3:20. [PMID: 18495041 PMCID: PMC2440734 DOI: 10.1186/1745-6150-3-20] [Citation(s) in RCA: 244] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2008] [Accepted: 05/21/2008] [Indexed: 11/10/2022] Open

Jones SJM. Prediction of genomic functional elements. Annu Rev Genomics Hum Genet 2008;7:315-38. [PMID: 16824019 DOI: 10.1146/annurev.genom.7.080505.115745] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Sarje A, Aluru S. Parallel biological sequence alignments on the Cell Broadband Engine. ACTA ACUST UNITED AC 2008. [DOI: 10.1109/ipdps.2008.4536328] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]

Gotoh O. A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence. Nucleic Acids Res 2008;36:2630-8. [PMID: 18344523 PMCID: PMC2377433 DOI: 10.1093/nar/gkn105] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open

Gene Identification: Classical and Computational Intelligence Approaches. ACTA ACUST UNITED AC 2008. [DOI: 10.1109/tsmcc.2007.906066] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]

Schulze U, Hepp B, Ong CS, Rätsch G. PALMA: mRNA to genome alignments using large margin algorithms. Bioinformatics 2007;23:1892-900. [PMID: 17537755 DOI: 10.1093/bioinformatics/btm275] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 2007;23:1061-7. [PMID: 17332020 DOI: 10.1093/bioinformatics/btm071] [Citation(s) in RCA: 1553] [Impact Index Per Article: 91.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer RJ, Schölkopf B. Improving the Caenorhabditis elegans genome annotation using machine learning. PLoS Comput Biol 2007;3:e20. [PMID: 17319737 PMCID: PMC1808025 DOI: 10.1371/journal.pcbi.0030020] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2006] [Accepted: 12/20/2006] [Indexed: 11/19/2022] Open

Bernal A, Crammer K, Hatzigeorgiou A, Pereira F. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput Biol 2007;3:e54. [PMID: 17367206 PMCID: PMC1828702 DOI: 10.1371/journal.pcbi.0030054] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2006] [Accepted: 02/01/2007] [Indexed: 11/18/2022] Open

Abstract

Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns.

We describe a new approach to statistical learning for sequence data that is broadly applicable to computational biology problems and that has experimentally demonstrated advantages over current hidden Markov model (HMM)-based methods for sequence analysis. The methods we describe in this paper, implemented in the CRAIG program, allow researchers to modularly specify and train sequence analysis models that combine a wide range of weakly informative features into globally optimal predictions. Our results for the gene prediction problem show significant improvements over existing ab initio gene predictors on a variety of tests, including the specially challenging ENCODE regions. Such improved predictions, particularly on initial and single exons, could benefit researchers who are seeking more accurate means of recognizing such important features as signal peptides and regulatory regions. More generally, we believe that our method, by combining the structure-describing capabilities of HMMs with the accuracy of margin-based classification methods, provides a general tool for statistical learning in biological sequences that will replace HMMs in any sequence modeling task for which there is annotated training data.

Collapse

Gertz EM, Yu YK, Agarwala R, Schäffer AA, Altschul SF. Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol 2006;4:41. [PMID: 17156431 PMCID: PMC1779365 DOI: 10.1186/1741-7007-4-41] [Citation(s) in RCA: 331] [Impact Index Per Article: 18.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2006] [Accepted: 12/07/2006] [Indexed: 11/29/2022] Open

Abstract

Background

TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server.

Results

We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy.

Conclusion

TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms.

Collapse

Wong ESW, Young LJ, Papenfuss AT, Belov K. In silico identification of opossum cytokine genes suggests the complexity of the marsupial immune system rivals that of eutherian mammals. Immunome Res 2006;2:4. [PMID: 17094811 PMCID: PMC1660534 DOI: 10.1186/1745-7580-2-4] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2006] [Accepted: 11/10/2006] [Indexed: 11/10/2022] Open

Wu J, Haussler D. Coding exon detection using comparative sequences. J Comput Biol 2006;13:1148-64. [PMID: 16901234 DOI: 10.1089/cmb.2006.13.1148] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Kent C, Landau GM, Ziv-Ukelson M. On the complexity of sparse exon assembly. J Comput Biol 2006;13:1013-27. [PMID: 16796548 DOI: 10.1089/cmb.2006.13.1013] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Schultz AK, Zhang M, Leitner T, Kuiken C, Korber B, Morgenstern B, Stanke M. A jumping profile Hidden Markov Model and applications to recombination sites in HIV and HCV genomes. BMC Bioinformatics 2006;7:265. [PMID: 16716226 PMCID: PMC1525204 DOI: 10.1186/1471-2105-7-265] [Citation(s) in RCA: 75] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2005] [Accepted: 05/22/2006] [Indexed: 12/11/2022] Open

Chatterji S, Pachter L. Reference based annotation with GeneMapper. Genome Biol 2006;7:R29. [PMID: 16600017 PMCID: PMC1557983 DOI: 10.1186/gb-2006-7-4-r29] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2005] [Revised: 02/03/2006] [Accepted: 03/03/2006] [Indexed: 11/22/2022] Open

Ko P, Narayanan M, Kalyanaraman A, Aluru S. Space-conserving optimal DNA-protein alignment. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2006:80-8. [PMID: 16448002 DOI: 10.1109/csb.2004.1332420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]

Bekaert M, Richard H, Prum B, Rousset JP. Identification of programmed translational -1 frameshifting sites in the genome of Saccharomyces cerevisiae. Genome Res 2006;15:1411-20. [PMID: 16204194 PMCID: PMC1240084 DOI: 10.1101/gr.4258005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Volpe JM, Cowell LG, Kepler TB. SoDA: implementation of a 3D alignment algorithm for inference of antigen receptor recombinations. Bioinformatics 2005;22:438-44. [PMID: 16357034 DOI: 10.1093/bioinformatics/btk004] [Citation(s) in RCA: 99] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Brent MR. Genome annotation past, present, and future: How to define an ORF at each locus. Genome Res 2005;15:1777-86. [PMID: 16339376 DOI: 10.1101/gr.3866105] [Citation(s) in RCA: 90] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 2005;33:6494-506. [PMID: 16314312 PMCID: PMC1298918 DOI: 10.1093/nar/gki937] [Citation(s) in RCA: 546] [Impact Index Per Article: 28.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2005] [Revised: 10/12/2005] [Accepted: 10/12/2005] [Indexed: 11/25/2022] Open

Abstract

Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. However, genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools tuned up for previously studied species are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Gene identification methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. Conventional statistical ab initio methods require large training sets of validated genes for estimating gene model parameters. In practice, neither one of these types of data may be available in sufficient amount until rather late stages of the novel genome sequencing. Nevertheless, we have shown that gene finding in eukaryotic genomes could be carried out in parallel with statistical models estimation directly from yet anonymous genomic DNA. The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Rounds of genomic sequence labeling into coding and non-coding regions are followed by the rounds of model parameters estimation. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step. Several novel genomes have been analyzed and biologically interesting findings are discussed. Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification.

Collapse

Churbanov A, Pauley M, Quest D, Ali H. A method of precise mRNA/DNA homology-based gene structure prediction. BMC Bioinformatics 2005;6:261. [PMID: 16242044 PMCID: PMC1274302 DOI: 10.1186/1471-2105-6-261] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2005] [Accepted: 10/21/2005] [Indexed: 11/29/2022] Open

Wang Z, Chen Y, Li Y. A brief review of computational gene prediction methods. GENOMICS PROTEOMICS & BIOINFORMATICS 2005;2:216-21. [PMID: 15901250 PMCID: PMC5187414 DOI: 10.1016/s1672-0229(04)02028-5] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]

Gouret P, Vitiello V, Balandraud N, Gilles A, Pontarotti P, Danchin EGJ. FIGENIX: intelligent automation of genomic annotation: expertise integration in a new software platform. BMC Bioinformatics 2005;6:198. [PMID: 16083500 PMCID: PMC1188056 DOI: 10.1186/1471-2105-6-198] [Citation(s) in RCA: 103] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2005] [Accepted: 08/05/2005] [Indexed: 11/24/2022] Open