1
|
Brůna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform 2021; 3:lqaa108. [PMID: 33575650 PMCID: PMC7787252 DOI: 10.1093/nargab/lqaa108] [Citation(s) in RCA: 510] [Impact Index Per Article: 170.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Revised: 11/26/2020] [Accepted: 12/20/2020] [Indexed: 12/13/2022] Open
Abstract
The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.
Collapse
Affiliation(s)
- Tomáš Brůna
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Katharina J Hoff
- Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany
- Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Alexandre Lomsadze
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Mario Stanke
- Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany
- Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Mark Borodovsky
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| |
Collapse
|
2
|
Yang A, Zhang W, Wang J, Yang K, Han Y, Zhang L. Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA. Front Bioeng Biotechnol 2020; 8:1032. [PMID: 33015010 PMCID: PMC7498545 DOI: 10.3389/fbioe.2020.01032] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2020] [Accepted: 08/10/2020] [Indexed: 11/13/2022] Open
Abstract
Deoxyribonucleic acid (DNA) is a biological macromolecule. Its main function is information storage. At present, the advancement of sequencing technology had caused DNA sequence data to grow at an explosive rate, which has also pushed the study of DNA sequences in the wave of big data. Moreover, machine learning is a powerful technique for analyzing largescale data and learns spontaneously to gain knowledge. It has been widely used in DNA sequence data analysis and obtained a lot of research achievements. Firstly, the review introduces the development process of sequencing technology, expounds on the concept of DNA sequence data structure and sequence similarity. Then we analyze the basic process of data mining, summary several major machine learning algorithms, and put forward the challenges faced by machine learning algorithms in the mining of biological sequence data and possible solutions in the future. Then we review four typical applications of machine learning in DNA sequence data: DNA sequence alignment, DNA sequence classification, DNA sequence clustering, and DNA pattern mining. We analyze their corresponding biological application background and significance, and systematically summarized the development and potential problems in the field of DNA sequence data mining in recent years. Finally, we summarize the content of the review and look into the future of some research directions for the next step.
Collapse
Affiliation(s)
- Aimin Yang
- College of Science, North China University of Science and Technology, Tangshan, China
| | - Wei Zhang
- College of Science, North China University of Science and Technology, Tangshan, China
| | - Jiahao Wang
- College of Science, North China University of Science and Technology, Tangshan, China
| | - Ke Yang
- College of Yi Sheng, North China University of Science and Technology, Tangshan, China
| | - Yang Han
- College of Science, North China University of Science and Technology, Tangshan, China
| | - Limin Zhang
- Mathmatics and Computer Department, Hengshui University, Hengshui, China
| |
Collapse
|
3
|
Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 2005; 33:6494-506. [PMID: 16314312 PMCID: PMC1298918 DOI: 10.1093/nar/gki937] [Citation(s) in RCA: 546] [Impact Index Per Article: 28.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2005] [Revised: 10/12/2005] [Accepted: 10/12/2005] [Indexed: 11/25/2022] Open
Abstract
Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. However, genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools tuned up for previously studied species are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Gene identification methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. Conventional statistical ab initio methods require large training sets of validated genes for estimating gene model parameters. In practice, neither one of these types of data may be available in sufficient amount until rather late stages of the novel genome sequencing. Nevertheless, we have shown that gene finding in eukaryotic genomes could be carried out in parallel with statistical models estimation directly from yet anonymous genomic DNA. The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Rounds of genomic sequence labeling into coding and non-coding regions are followed by the rounds of model parameters estimation. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step. Several novel genomes have been analyzed and biologically interesting findings are discussed. Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification.
Collapse
Affiliation(s)
- Alexandre Lomsadze
- School of Biology, Georgia Institute of TechnologyAtlanta, GA 30332-0230, USA
| | | | - Yury O. Chernoff
- School of Biology, Georgia Institute of TechnologyAtlanta, GA 30332-0230, USA
| | - Mark Borodovsky
- School of Biology, Georgia Institute of TechnologyAtlanta, GA 30332-0230, USA
- Department of Biomedical Engineering, Georgia Institute of TechnologyAtlanta, GA 30332-0535, USA
| |
Collapse
|
4
|
Churbanov A, Pauley M, Quest D, Ali H. A method of precise mRNA/DNA homology-based gene structure prediction. BMC Bioinformatics 2005; 6:261. [PMID: 16242044 PMCID: PMC1274302 DOI: 10.1186/1471-2105-6-261] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2005] [Accepted: 10/21/2005] [Indexed: 11/29/2022] Open
Abstract
Background Accurate and automatic gene finding and structural prediction is a common problem in bioinformatics, and applications need to be capable of handling non-canonical splice sites, micro-exons and partial gene structure predictions that span across several genomic clones. Results We present a mRNA/DNA homology based gene structure prediction tool, GIGOgene. We use a new affine gap penalty splice-enhanced global alignment algorithm running in linear memory for a high quality annotation of splice sites. Our tool includes a novel algorithm to assemble partial gene structure predictions using interval graphs. GIGOgene exhibited a sensitivity of 99.08% and a specificity of 99.98% on the Genie learning set, and demonstrated a higher quality of gene structural prediction when compared to Sim4, est2genome, Spidey, Galahad and BLAT, including when genes contained micro-exons and non-canonical splice sites. GIGOgene showed an acceptable loss of prediction quality when confronted with a noisy Genie learning set simulating ESTs. Conclusion GIGOgene shows a higher quality of gene structure prediction for mRNA/DNA spliced alignment when compared to other available tools.
Collapse
Affiliation(s)
- Alexander Churbanov
- Department of Computer Science, College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182-0116, USA
| | - Mark Pauley
- Department of Computer Science, College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182-0116, USA
| | - Daniel Quest
- Department of Computer Science, College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182-0116, USA
| | - Hesham Ali
- Department of Computer Science, College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182-0116, USA
| |
Collapse
|
5
|
Milanesi L, Rogozin IB. ESTMAP: a system for expressed sequence tags mapping on genomic sequences. IEEE Trans Nanobioscience 2004; 2:75-8. [PMID: 15382662 DOI: 10.1109/tnb.2003.813928] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The completion of a number of large genome sequencing projects emphasizes the importance of protein-coding gene predictions. Most of the problems associated with gene prediction are caused by the complex exon-intron structures commonly found in eukaryotic genomes. However, information from homologous sequences can significantly improve the accuracy of the prediction. In particular, expressed sequence tags (ESTs) are very useful for this purpose, since currently existing EST collections are very large. We developed an ESTMAP system, which utilizes homology searches against a database of repetitive elements using the RepeatView program and the EST Division of GenBank using the BLASTN program. ESTMAP extracts "exact" matches with EST sequences (> 95% of homology) from BLASTN output file and predicts introns in DNA comparing ESTs and a query sequence. ESTMAP is implemented as a part of the WebGene system (http://www.cnr.it/webgene).
Collapse
|
6
|
Mathé C, Sagot MF, Schiex T, Rouzé P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 2002; 30:4103-17. [PMID: 12364589 PMCID: PMC140543 DOI: 10.1093/nar/gkf543] [Citation(s) in RCA: 209] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2002] [Revised: 08/07/2002] [Accepted: 08/07/2002] [Indexed: 11/14/2022] Open
Abstract
While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along a genome. This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations. The main mathematical models and computational algorithms adopted are also briefly described and the resulting software classified according to both the method and the type of evidence used. Finally, the several difficulties and pitfalls encountered by the programs are detailed, showing that improvements are needed and that new directions must be considered.
Collapse
Affiliation(s)
- Catherine Mathé
- Institut de Pharmacologie et Biologie Structurale, UMR 5089, 205 route de Narbonne, F-31077 Toulouse Cedex, France.
| | | | | | | |
Collapse
|
7
|
Migeon BR, Chowdhury AK, Dunston JA, McIntosh I. Identification of TSIX, encoding an RNA antisense to human XIST, reveals differences from its murine counterpart: implications for X inactivation. Am J Hum Genet 2001; 69:951-60. [PMID: 11555794 PMCID: PMC1274371 DOI: 10.1086/324022] [Citation(s) in RCA: 101] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2001] [Accepted: 08/27/2001] [Indexed: 11/03/2022] Open
Abstract
X inactivation is the mammalian method for X-chromosome dosage compensation, but some features of this developmental process vary among mammals. Such species variations provide insights into the essential components of the pathway. Tsix encodes a transcript antisense to the murine Xist transcript and is expressed in the mouse embryo only during the initial stages of X inactivation; it has been shown to play a role in imprinted X inactivation in the mouse placenta. We have identified its counterpart within the human X inactivation center (XIC). Human TSIX produces a >30-kb transcript that is expressed only in cells of fetal origin; it is expressed from human XIC transgenes in mouse embryonic stem cells and from human embryoid-body-derived cells, but not from human adult somatic cells. Differences in the structure of human and murine genes indicate that human TSIX was truncated during evolution. These differences could explain the fact that X inactivation is not imprinted in human placenta, and they raise questions about the role of TSIX in random X inactivation.
Collapse
MESH Headings
- Aging/genetics
- Animals
- Cell Line
- Dosage Compensation, Genetic
- Embryo, Mammalian/metabolism
- Evolution, Molecular
- Fetus/metabolism
- Genomic Imprinting/genetics
- Humans
- Mice
- Molecular Sequence Data
- Open Reading Frames/genetics
- Placenta/metabolism
- RNA, Antisense/analysis
- RNA, Antisense/biosynthesis
- RNA, Antisense/genetics
- RNA, Antisense/isolation & purification
- RNA, Long Noncoding
- RNA, Untranslated/analysis
- RNA, Untranslated/biosynthesis
- RNA, Untranslated/genetics
- RNA, Untranslated/isolation & purification
- Sequence Deletion/genetics
- Sequence Homology, Nucleic Acid
- Species Specificity
- Stem Cells/metabolism
- Transcription Factors/genetics
- Transcription Initiation Site
- Transcription, Genetic
- Transgenes/genetics
Collapse
Affiliation(s)
- B R Migeon
- McKusick-Nathans Institute of Genetic Medicine and Department of Pediatrics, The Johns Hopkins University School of Medicine, Baltimore, MD, 21287, USA.
| | | | | | | |
Collapse
|