1
|
Chowdhury B, Garai A, Garai G. An optimized approach for annotation of large eukaryotic genomic sequences using genetic algorithm. BMC Bioinformatics 2017; 18:460. [PMID: 29065853 PMCID: PMC5655831 DOI: 10.1186/s12859-017-1874-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 10/17/2017] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND Detection of important functional and/or structural elements and identification of their positions in a large eukaryotic genomic sequence are an active research area. Gene is an important functional and structural unit of DNA. The computation of gene prediction is, therefore, very essential for detailed genome annotation. RESULTS In this paper, we propose a new gene prediction technique based on Genetic Algorithm (GA) to determine the optimal positions of exons of a gene in a chromosome or genome. The correct identification of the coding and non-coding regions is difficult and computationally demanding. The proposed genetic-based method, named Gene Prediction with Genetic Algorithm (GPGA), reduces this problem by searching only one exon at a time instead of all exons along with its introns. This representation carries a significant advantage in that it breaks the entire gene-finding problem into a number of smaller sub-problems, thereby reducing the computational complexity. We tested the performance of the GPGA with existing benchmark datasets and compared the results with well-known and relevant techniques. The comparison shows the better or comparable performance of the proposed method. We also used GPGA for annotating the human chromosome 21 (HS21) using cross-species comparisons with the mouse orthologs. CONCLUSION It was noted that the GPGA predicted true genes with better accuracy than other well-known approaches.
Collapse
Affiliation(s)
- Biswanath Chowdhury
- Department of Biophysics, Molecular Biology and Bioinformatics, University of Calcutta, Kolkata, 700009 WB India
| | - Arnav Garai
- Unit of Energy, Utilities, Communications and Services, Infosys Technologies Ltd., Bhubaneswar, 751024 Odisha India
| | - Gautam Garai
- Computational Sciences Division, Saha Institute of Nuclear Physics, Kolkata, 700064 WB India
| |
Collapse
|
2
|
Santos A, Tsafou K, Stolte C, Pletscher-Frankild S, O’Donoghue SI, Jensen LJ. Comprehensive comparison of large-scale tissue expression datasets. PeerJ 2015; 3:e1054. [PMID: 26157623 PMCID: PMC4493645 DOI: 10.7717/peerj.1054] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2015] [Accepted: 06/04/2015] [Indexed: 01/01/2023] Open
Abstract
For tissues to carry out their functions, they rely on the right proteins to be present. Several high-throughput technologies have been used to map out which proteins are expressed in which tissues; however, the data have not previously been systematically compared and integrated. We present a comprehensive evaluation of tissue expression data from a variety of experimental techniques and show that these agree surprisingly well with each other and with results from literature curation and text mining. We further found that most datasets support the assumed but not demonstrated distinction between tissue-specific and ubiquitous expression. By developing comparable confidence scores for all types of evidence, we show that it is possible to improve both quality and coverage by combining the datasets. To facilitate use and visualization of our work, we have developed the TISSUES resource (http://tissues.jensenlab.org), which makes all the scored and integrated data available through a single user-friendly web interface.
Collapse
Affiliation(s)
- Alberto Santos
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Kalliopi Tsafou
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Christian Stolte
- Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia
| | - Sune Pletscher-Frankild
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Seán I. O’Donoghue
- Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia
- Garvan Institute of Medical Research, Sydney, Australia
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
3
|
Abstract
As the genomics era matures, the availability of complete microbial genome sequences is facilitating computational approaches to understand bacterial genomes and DNA structure/function relationships. From the genome of pathogens, we can derive invaluable information on potential targets for new antimicrobial agents. Advancements in high-throughput 'omics' technologies and the availability of multiple isolates of the same species have significantly changed the time frame and scope for identifying novel therapeutic targets. This article aims to discuss selected aspects of the bacterial genome, and advocates 'omics'-based techniques to advance the discovery of new therapeutic targets against extracellular bacterial pathogens.
Collapse
Affiliation(s)
- Nagathihalli S Nagaraj
- Department of Surgery, Vanderbilt University School of Medicine, Nashville, TN 37232, USA.
| | | |
Collapse
|
4
|
Chowdhary BP, Raudsepp T. The horse genome derby: racing from map to whole genome sequence. Chromosome Res 2008; 16:109-27. [PMID: 18274866 DOI: 10.1007/s10577-008-1204-z] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
The map of the horse genome has undergone unprecedented expansion during the past six years. Beginning from a modest collection of approximately 300 mapped markers scattered on the 31 pairs of autosomes and the X chromosome in 2001, today the horse genome is among the best-mapped in domestic animals. Presently, high-resolution linearly ordered gene maps are available for all autosomes as well as the X and the Y chromosome. The approximately 4350 mapped markers distributed over the approximately 2.68 Gbp long equine genome provide on average 1 marker every 620 kb. Among the most remarkable developments in equine genome analysis is the availability of the assembled sequence (EquCab2) of the female horse genome and the generation approximately 1.5 million single nucleotide polymorphisms (SNPs) from diverse breeds. This has triggered the creation of new tools and resources like the 60K SNP-chip and whole genome expression microarrays that hold promise to study the equine genome and transcriptome in ways not previously envisaged. As a result of these developments it is anticipated that, during coming years, the genetics underlying important monogenic traits will be analyzed with improved accuracy and speed. Of larger interest will be the prospects of dissecting the genetic component of various complex/multigenic traits that are of vital significance for equine health and welfare. The number of investigations recently initiated to study a multitude of such traits hold promise for improved diagnostics, prevention and therapeutic approaches for horses.
Collapse
Affiliation(s)
- Bhanu P Chowdhary
- Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, TX, 77843-4458, USA.
| | | |
Collapse
|
5
|
Salzburger W, Renn SCP, Steinke D, Braasch I, Hofmann HA, Meyer A. Annotation of expressed sequence tags for the East African cichlid fish Astatotilapia burtoni and evolutionary analyses of cichlid ORFs. BMC Genomics 2008; 9:96. [PMID: 18298844 PMCID: PMC2279125 DOI: 10.1186/1471-2164-9-96] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2007] [Accepted: 02/25/2008] [Indexed: 11/13/2022] Open
Abstract
Background The cichlid fishes in general, and the exceptionally diverse East African haplochromine cichlids in particular, are famous examples of adaptive radiation and explosive speciation. Here we report the collection and annotation of more than 12,000 expressed sequence tags (ESTs) generated from three different cDNA libraries obtained from the East African haplochromine cichlid species Astatotilapia burtoni and Metriaclima zebra. Results We first annotated more than 12,000 newly generated cichlid ESTs using the Gene Ontology classification system. For evolutionary analyses, we combined these ESTs with all available sequence data for haplochromine cichlids, which resulted in a total of more than 45,000 ESTs. The ESTs represent a broad range of molecular functions and biological processes. We compared the haplochromine ESTs to sequence data from those available for other fish model systems such as pufferfish (Takifugu rubripes and Tetraodon nigroviridis), trout, and zebrafish. We characterized genes that show a faster or slower rate of base substitutions in haplochromine cichlids compared to other fish species, as this is indicative of a relaxed or reinforced selection regime. Four of these genes showed the signature of positive selection as revealed by calculating Ka/Ks ratios. Conclusion About 22% of the surveyed ESTs were found to have cichlid specific rate differences suggesting that these genes might play a role in lineage specific characteristics of cichlids. We also conclude that the four genes with a Ka/Ks ratio greater than one appear as good candidate genes for further work on the genetic basis of evolutionary success of haplochromine cichlid fishes.
Collapse
Affiliation(s)
- Walter Salzburger
- Lehrstuhl für Zoologie und Evolutionsbiologie, Department of Biology, University of Konstanz, 78467 Konstanz, Germany.
| | | | | | | | | | | |
Collapse
|
6
|
Bryson K, Loux V, Bossy R, Nicolas P, Chaillou S, van de Guchte M, Penaud S, Maguin E, Hoebeke M, Bessières P, Gibrat JF. AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system. Nucleic Acids Res 2006; 34:3533-45. [PMID: 16855290 PMCID: PMC1524909 DOI: 10.1093/nar/gkl471] [Citation(s) in RCA: 80] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
We have implemented a genome annotation system for prokaryotes called AGMIAL. Our approach embodies a number of key principles. First, expert manual annotators are seen as a critical component of the overall system; user interfaces were cyclically refined to satisfy their needs. Second, the overall process should be orchestrated in terms of a global annotation strategy; this facilitates coordination between a team of annotators and automatic data analysis. Third, the annotation strategy should allow progressive and incremental annotation from a time when only a few draft contigs are available, to when a final finished assembly is produced. The overall architecture employed is modular and extensible, being based on the W3 standard Web services framework. Specialized modules interact with two independent core modules that are used to annotate, respectively, genomic and protein sequences. AGMIAL is currently being used by several INRA laboratories to analyze genomes of bacteria relevant to the food-processing industry, and is distributed under an open source license.
Collapse
Affiliation(s)
| | | | | | | | - S. Chaillou
- Flore Lactique et Environnement Carné, INRA78352 Jouy-en-Josas Cedex, France
| | | | - S. Penaud
- Génétique Microbienne, INRA78352 Jouy-en-Josas Cedex, France
| | - E. Maguin
- Génétique Microbienne, INRA78352 Jouy-en-Josas Cedex, France
| | | | | | - J-F Gibrat
- To whom correspondence should be addressed. Tel: +33 1 34 65 28 97; Fax: +33 1 34 65 29 01; E-mail:
| |
Collapse
|
7
|
Vinayagam A, del Val C, Schubert F, Eils R, Glatting KH, Suhai S, König R. GOPET: a tool for automated predictions of Gene Ontology terms. BMC Bioinformatics 2006; 7:161. [PMID: 16549020 PMCID: PMC1434778 DOI: 10.1186/1471-2105-7-161] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2005] [Accepted: 03/20/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Vast progress in sequencing projects has called for annotation on a large scale. A Number of methods have been developed to address this challenging task. These methods, however, either apply to specific subsets, or their predictions are not formalised, or they do not provide precise confidence values for their predictions. DESCRIPTION We recently established a learning system for automated annotation, trained with a broad variety of different organisms to predict the standardised annotation terms from Gene Ontology (GO). Now, this method has been made available to the public via our web-service GOPET (Gene Ontology term Prediction and Evaluation Tool). It supplies annotation for sequences of any organism. For each predicted term an appropriate confidence value is provided. The basic method had been developed for predicting molecular function GO-terms. It is now expanded to predict biological process terms. This web service is available via http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar CONCLUSION Our web service gives experimental researchers as well as the bioinformatics community a valuable sequence annotation device. Additionally, GOPET also provides less significant annotation data which may serve as an extended discovery platform for the user.
Collapse
Affiliation(s)
- Arunachalam Vinayagam
- Department of Molecular Biophysics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
| | - Coral del Val
- Department of Molecular Biophysics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
| | - Falk Schubert
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
| | - Roland Eils
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
- Department of Bioinformatics and Functional Genomics, Institute for Pharmacy and Molecular Biotechnology, University of Heidelberg, 69120 Heidelberg, Germany
| | - Karl-Heinz Glatting
- Department of Molecular Biophysics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
| | - Sándor Suhai
- Department of Molecular Biophysics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
| | - Rainer König
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
- Department of Bioinformatics and Functional Genomics, Institute for Pharmacy and Molecular Biotechnology, University of Heidelberg, 69120 Heidelberg, Germany
| |
Collapse
|
8
|
Valencia A. Automatic annotation of protein function. Curr Opin Struct Biol 2005; 15:267-74. [PMID: 15922590 DOI: 10.1016/j.sbi.2005.05.010] [Citation(s) in RCA: 85] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2005] [Revised: 04/29/2005] [Accepted: 05/10/2005] [Indexed: 11/22/2022]
Abstract
The annotation of protein function at genomic scale is essential for day-to-day work in biology and for any systematic approach to the modeling of biological systems. Currently, functional annotation is essentially based on the expansion of the relatively small number of experimentally determined functions to large collections of proteins. The task of systematic annotation faces formidable practical problems related to the accuracy of the input experimental information, the reliability of current systems for transferring information between related sequences, and the reproducibility of the links between database information and the original experiments reported in publications. These technical difficulties merely lie on the surface of the deeper problem of the evolution of protein function in the context of protein sequences and structures. Given the mixture of technical and scientific challenges, it is not surprising that errors are introduced, and expanded, in database annotations. In this situation, a more realistic option is the development of a reliability index for database annotations, instead of depending exclusively on efforts to correct databases. Several groups have attempted to compare the database annotations of similar proteins, which constitutes the first steps toward the calibration of the relationship between sequence and annotation space.
Collapse
Affiliation(s)
- Alfonso Valencia
- Protein Design Group, National Center for Biotechnology, CNB-CSIC, Darwin 3, Cantoblanco, 28049 Madrid, Spain.
| |
Collapse
|
9
|
Vinayagam A, König R, Moormann J, Schubert F, Eils R, Glatting KH, Suhai S. Applying Support Vector Machines for Gene Ontology based gene function prediction. BMC Bioinformatics 2004; 5:116. [PMID: 15333146 PMCID: PMC517617 DOI: 10.1186/1471-2105-5-116] [Citation(s) in RCA: 55] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2004] [Accepted: 08/26/2004] [Indexed: 11/23/2022] Open
Abstract
Background The current progress in sequencing projects calls for rapid, reliable and accurate function assignments of gene products. A variety of methods has been designed to annotate sequences on a large scale. However, these methods can either only be applied for specific subsets, or their results are not formalised, or they do not provide precise confidence estimates for their predictions. Results We have developed a large-scale annotation system that tackles all of these shortcomings. In our approach, annotation was provided through Gene Ontology terms by applying multiple Support Vector Machines (SVM) for the classification of correct and false predictions. The general performance of the system was benchmarked with a large dataset. An organism-wise cross-validation was performed to define confidence estimates, resulting in an average precision of 80% for 74% of all test sequences. The validation results show that the prediction performance was organism-independent and could reproduce the annotation of other automated systems as well as high-quality manual annotations. We applied our trained classification system to Xenopus laevis sequences, yielding functional annotation for more than half of the known expressed genome. Compared to the currently available annotation, we provided more than twice the number of contigs with good quality annotation, and additionally we assigned a confidence value to each predicted GO term. Conclusions We present a complete automated annotation system that overcomes many of the usual problems by applying a controlled vocabulary of Gene Ontology and an established classification method on large and well-described sequence data sets. In a case study, the function for Xenopus laevis contig sequences was predicted and the results are publicly available at .
Collapse
Affiliation(s)
- Arunachalam Vinayagam
- Department of Molecular Biophysics, Deutsches Krebsforschungszentrum (DKFZ), TP3, Im Neuenheimer Feld 580, Heidelberg, D-69120, Germany
| | - Rainer König
- Theoretical Bioinformatics, Deutsches Krebsforschungszentrum (DKFZ), TP3, Im Neuenheimer Feld 580, Heidelberg, D-69120, Germany
| | - Jutta Moormann
- Department of Molecular Biophysics, Deutsches Krebsforschungszentrum (DKFZ), TP3, Im Neuenheimer Feld 580, Heidelberg, D-69120, Germany
- Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI), Johannes Gutenberg-Universität Mainz, 55101, Mainz, Germany
| | - Falk Schubert
- Theoretical Bioinformatics, Deutsches Krebsforschungszentrum (DKFZ), TP3, Im Neuenheimer Feld 580, Heidelberg, D-69120, Germany
| | - Roland Eils
- Theoretical Bioinformatics, Deutsches Krebsforschungszentrum (DKFZ), TP3, Im Neuenheimer Feld 580, Heidelberg, D-69120, Germany
| | - Karl-Heinz Glatting
- Department of Molecular Biophysics, Deutsches Krebsforschungszentrum (DKFZ), TP3, Im Neuenheimer Feld 580, Heidelberg, D-69120, Germany
| | - Sándor Suhai
- Department of Molecular Biophysics, Deutsches Krebsforschungszentrum (DKFZ), TP3, Im Neuenheimer Feld 580, Heidelberg, D-69120, Germany
| |
Collapse
|
10
|
Close J, Game L, Clark B, Bergounioux J, Gerovassili A, Thein SL. Genome annotation of a 1.5 Mb region of human chromosome 6q23 encompassing a quantitative trait locus for fetal hemoglobin expression in adults. BMC Genomics 2004; 5:33. [PMID: 15169551 PMCID: PMC441375 DOI: 10.1186/1471-2164-5-33] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2004] [Accepted: 05/31/2004] [Indexed: 12/24/2022] Open
Abstract
Background Heterocellular hereditary persistence of fetal hemoglobin (HPFH) is a common multifactorial trait characterized by a modest increase of fetal hemoglobin levels in adults. We previously localized a Quantitative Trait Locus for HPFH in an extensive Asian-Indian kindred to chromosome 6q23. As part of the strategy of positional cloning and a means towards identification of the specific genetic alteration in this family, a thorough annotation of the candidate interval based on a strategy of in silico / wet biology approach with comparative genomics was conducted. Results The ~1.5 Mb candidate region was shown to contain five protein-coding genes. We discovered a very large uncharacterized gene containing WD40 and SH3 domains (AHI1), and extended the annotation of four previously characterized genes (MYB, ALDH8A1, HBS1L and PDE7B). We also identified several genes that do not appear to be protein coding, and generated 17 kb of novel transcript sequence data from re-sequencing 97 EST clones. Conclusion Detailed and thorough annotation of this 1.5 Mb interval in 6q confirms a high level of aberrant transcripts in testicular tissue. The candidate interval was shown to exhibit an extraordinary level of alternate splicing – 19 transcripts were identified for the 5 protein coding genes, but it appears that a significant portion (14/19) of these alternate transcripts did not have an open reading frame, hence their functional role is questionable. These transcripts may result from aberrant rather than regulated splicing.
Collapse
Affiliation(s)
- James Close
- Department of Haematological Medicine, GKT School of Medicine, King's Denmark Hill Campus, Bessemer Road, London, SE5 9PJ, UK
- SANE POWIC, Warneford Hospital, Department of Psychiatry, University of Oxford, Oxford, OX3 7JX, UK
| | - Laurence Game
- Department of Haematological Medicine, GKT School of Medicine, King's Denmark Hill Campus, Bessemer Road, London, SE5 9PJ, UK
- CSC-IC Microarray Centre, 2nd floor, L-block, Room 221, Imperial College Faculty of Medicine, Hammersmith Hospital Campus, Du Cane Road, London, W12 0NN, UK
| | - Barnaby Clark
- Department of Haematological Medicine, GKT School of Medicine, King's Denmark Hill Campus, Bessemer Road, London, SE5 9PJ, UK
| | - Jean Bergounioux
- Department of Haematological Medicine, GKT School of Medicine, King's Denmark Hill Campus, Bessemer Road, London, SE5 9PJ, UK
- Unité de soins intensif pédiatrique, Hôpital Universitaire Krémlin Bicêtre, 63 av. Gabriel Péri, 94270 Le Krémlin Bicêtre, France
| | - Ageliki Gerovassili
- Department of Haematological Medicine, GKT School of Medicine, King's Denmark Hill Campus, Bessemer Road, London, SE5 9PJ, UK
| | - Swee Lay Thein
- Department of Haematological Medicine, GKT School of Medicine, King's Denmark Hill Campus, Bessemer Road, London, SE5 9PJ, UK
| |
Collapse
|
11
|
Liu C, Bonner TI, Nguyen T, Lyons JL, Christian SL, Gershon ES. DNannotator: Annotation software tool kit for regional genomic sequences. Nucleic Acids Res 2003; 31:3729-35. [PMID: 12824405 PMCID: PMC168949 DOI: 10.1093/nar/gkg542] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Sequence annotation is essential for genomics-based research. Investigators of a specific genomic region who have developed abundant local discoveries such as genes and genetic markers, or have collected annotations from multiple resources, can be overwhelmed by the difficulty in creating local annotation and the complexity of integrating all the annotations. Presenting such integrated data in a form suitable for data mining and high-throughput experimental design is even more daunting. DNannotator, a web application, was designed to perform batch annotation on a sizeable genomic region. It takes annotation source data, such as SNPs, genes, primers, and so on, prepared by the end-user and/or a specified target of genomic DNA, and performs de novo annotation. DNannotator can also robustly migrate existing annotations in GenBank format from one sequence to another. Annotation results are provided in GenBank format and in tab-delimited text, which can be imported and managed in a database or spreadsheet and combined with existing annotation as desired. Graphic viewers, such as Genome Browser or Artemis, can display the annotation results. Reference data (reports on the process) facilitating the user's evaluation of annotation quality are optionally provided. DNannotator can be accessed at http://sky.bsd.uchicago.edu/DNannotator.htm.
Collapse
Affiliation(s)
- Chunyu Liu
- Department of Psychiatry, University of Chicago, Chicago, IL, USA.
| | | | | | | | | | | |
Collapse
|
12
|
Chuang TJ, Lin WC, Lee HC, Wang CW, Hsiao KL, Wang ZH, Shieh D, Lin SC, Ch'ang LY. A complexity reduction algorithm for analysis and annotation of large genomic sequences. Genome Res 2003; 13:313-22. [PMID: 12566410 PMCID: PMC420370 DOI: 10.1101/gr.313703] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
DNA is a universal language encrypted with biological instruction for life. In higher organisms, the genetic information is preserved predominantly in an organized exon/intron structure. When a gene is expressed, the exons are spliced together to form the transcript for protein synthesis. We have developed a complexity reduction algorithm for sequence analysis (CRASA) that enables direct alignment of cDNA sequences to the genome. This method features a progressive data structure in hierarchical orders to facilitate a fast and efficient search mechanism. CRASA implementation was tested with already annotated genomic sequences in two benchmark data sets and compared with 15 annotation programs (10 ab initio and 5 homology-based approaches) against the EST database. By the use of layered noise filters, the complexity of CRASA-matched data was reduced exponentially. The results from the benchmark tests showed that CRASA annotation excelled in both the sensitivity and specificity categories. When CRASA was applied to the analysis of human Chromosomes 21 and 22, an additional 83 potential genes were identified. With its large-scale processing capability, CRASA can be used as a robust tool for genome annotation with high accuracy by matching the EST sequences precisely to the genomic sequences.
Collapse
MESH Headings
- Algorithms
- Chromosomes, Human, Pair 21/genetics
- Chromosomes, Human, Pair 22/genetics
- DNA/analysis
- DNA/genetics
- DNA, Complementary/analysis
- DNA, Complementary/genetics
- Exons/genetics
- Expressed Sequence Tags
- Genes/genetics
- Genome, Human
- Humans
- Pseudogenes/genetics
- Reproducibility of Results
- Sensitivity and Specificity
- Sequence Alignment/methods
- Sequence Analysis, DNA/methods
- Sequence Analysis, DNA/trends
- Sequence Homology, Nucleic Acid
Collapse
Affiliation(s)
- Trees-Juen Chuang
- Bioinformatics Research Center, Institute of Biomedical Sciences, Academia Sinica, Taipei 11529, Taiwan
| | | | | | | | | | | | | | | | | |
Collapse
|
13
|
The Kleisli Query System as a Backbone for Bioinformatics Data Integration and Analysis. Bioinformatics 2003. [DOI: 10.1016/b978-155860829-0/50008-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] Open
|
14
|
Médigue C, Bocs S, Labarre L, Mathé C, Vallenet D. L’annotationin silicodes séquences génomiques. Med Sci (Paris) 2002. [DOI: 10.1051/medsci/2002182237] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
15
|
Abstract
The advent of whole-genome data resources--not only sequence but also other genome-scale data collections such as gene expression, protein interaction, and genetic variation--is having two marked, complementary effects on the relatively new discipline of bioinformatics. First, the veritable flood of data is creating a need and demand for new tools for dealing adequately with the deluge, and, second, the unprecedented extent, diversity, and impending completeness of the data sets are creating opportunities for new approaches to discovery based on computational methods.
Collapse
Affiliation(s)
- D B Searls
- Bioinformatics Department, SmithKline Beecham Pharmaceuticals, King of Prussia, Pennsylvania 19406, USA.
| |
Collapse
|
16
|
Ruault M, Brun ME, Ventura M, Roizès G, De Sario A. MLL3, a new human member of the TRX/MLL gene family, maps to 7q36, a chromosome region frequently deleted in myeloid leukaemia. Gene 2002; 284:73-81. [PMID: 11891048 DOI: 10.1016/s0378-1119(02)00392-x] [Citation(s) in RCA: 92] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
We characterized MLL3, a new human member of the TRX/MLL gene family. MLL3 is expressed in peripheral blood, placenta, pancreas, testes, and foetal thymus and is weakly expressed in heart, brain, lung, liver, and kidney. It encodes a predicted protein of 4911 amino acids containing two plant homeo domains (PHD), an ATPase alpha_beta signature, a high mobility group, a SET (Suppressor of variegation, Enhancer of zeste, Trithorax) and two FY (phenylalanine tyrosine)-rich domains. The amino acid sequence of the SET domain was used to obtain a phylogenetic tree of human MLL genes and their homologues in different species. MLL3 is closely related to human MLL2, Fugu mll2, a Caenorhabditis elegans predicted protein, and Drosophila trithorax-related protein. Interestingly, PHD and SET domains are frequently found in proteins encoded by genes that are rearranged in different haematological malignancies and MLL3 maps to 7q36, a chromosome region that is frequently deleted in myeloid disorders. Partial duplications of the MLL3 gene are found in the juxtacentromeric region of chromosomes 1, 2, 13, and 21.
Collapse
Affiliation(s)
- Myriam Ruault
- Institut de Génétique Humaine, CNRS UPR 1142, 141, rue de la Cardonille, 34396, Montpellier, France
| | | | | | | | | |
Collapse
|
17
|
Bocs S, Danchin A, Médigue C. Re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes. BMC Bioinformatics 2002; 3:5. [PMID: 11879526 PMCID: PMC77393 DOI: 10.1186/1471-2105-3-5] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2001] [Accepted: 02/05/2002] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene) is imbedded in the approach. RESULTS We have developed a new program that automatically identifies biologically significant candidate genes in a bacterial genome. Twenty-six complete prokaryotic genomes were analyzed using this tool, and the accuracy of gene finding was assessed by comparison with existing annotations. This analysis revealed that, despite the enormous effort of genome program annotators, a small but not negligible number of genes annotated within the framework of sequencing projects are likely to be partially inaccurate or plainly wrong. Moreover, the analysis of several putative new genes shows that, as expected, many short genes have escaped annotation. In most cases, these new genes revealed frameshifts that could be either artifacts or genuine frameshifts. Some entirely unexpected new genes have also been identified. This allowed us to get a more complete picture of prokaryotic genomes. The results of this procedure are progressively integrated into the SWISS-PROT reference databank. CONCLUSIONS The results described in the present study show that our procedure is very satisfactory in terms of gene finding accuracy. Except in few cases, discrepancies between our results and annotations provided by individual authors can be accounted for by the nature of each annotation process or by specific characteristics of some genomes. This stresses that close cooperation between scientists, regular update and curation of the findings in databases are clearly required to reduce the level of errors in genome annotation (and hence in reducing the unfortunate spreading of errors through centralized data libraries).
Collapse
Affiliation(s)
- Stéphanie Bocs
- Laboratoire Génome et Informatique, Université de Versailles, 91034 Evry Cedex, France
| | - Antoine Danchin
- HKU-Pasteur Research Center, Pokfulam, Hong-Kong
- Génétique des Génomes Bactériens, Institut Pasteur, 75724 Paris Cedex 15, France
| | - Claudine Médigue
- Génétique des Génomes Bactériens, Institut Pasteur, 75724 Paris Cedex 15, France
| |
Collapse
|
18
|
Sonnhammer EL, Wootton JC. Integrated graphical analysis of protein sequence features predicted from sequence composition. Proteins 2001; 45:262-73. [PMID: 11599029 DOI: 10.1002/prot.1146] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Several protein sequence analysis algorithms are based on properties of amino acid composition and repetitiveness. These include methods for prediction of secondary structure elements, coiled-coils, transmembrane segments or signal peptides, and for assignment of low-complexity, nonglobular, or intrinsically unstructured regions. The quality of such analyses can be greatly enhanced by graphical software tools that present predicted sequence features together in context and allow judgment to be focused simultaneously on several different types of supporting information. For these purposes, we describe the SFINX package, which allows many different sets of segmental or continuous-curve sequence feature data, generated by individual external programs, to be viewed in combination alongside a sequence dot-plot or a multiple alignment of database matches. The implementation is currently based on extensions to the graphical viewers Dotter and Blixem and scripts that convert data from external programs to a simple generic data definition format called SFS. We describe applications in which dot-plots and flanking database matches provide valuable contextual information for analyses based on compositional and repetitive sequence features. The system is also useful for comparing results from algorithms run with a range of parameters to determine appropriate values for defaults or cutoffs for large-scale genomic analyses.
Collapse
Affiliation(s)
- E L Sonnhammer
- Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden.
| | | |
Collapse
|
19
|
Pertsemlidis A, Pande A, Miller B, Schilling P, Wei MH, Lerman MI, Minna JD, Garner HR, Mittelman D. PANORAMA: an integrated Web-based sequence analysis tool and its role in gene discovery. Genomics 2000; 70:300-6. [PMID: 11161780 DOI: 10.1006/geno.2000.6359] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
As the exponential growth of DNA sequence information in databases continues, the task of converting this deposited information into knowledge becomes more dependent on integrative sequence analysis and visualization tools. PANORAMA is an Internet-accessible software package that performs a variety of informatics analyses on a given DNA sequence and returns a visual and interactive representation of the results. Its design is modular, so that further sequence analysis tools can be integrated with minimal effort. The utility of PANORAMA is demonstrated in the analysis of 650 kb of human genomic DNA from chromosome region 3p21.3, a region of potential tumor suppressor genes involved in lung cancer, breast cancer, and other forms of cancer. PANORAMA aided in the discovery of genes and alternate splice forms of known exons, in the demarcation of intron-exon boundaries, and in the identification of promoter regions and polymorphisms, all of which contributed to a better understanding of the region. PANORAMA is available on the World Wide Web at http://atlas.swmed.edu.
Collapse
Affiliation(s)
- A Pertsemlidis
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas 75390, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Abstract
Computational genomics is a subfield of computational biology that deals with the analysis of entire genome sequences. Transcending the boundaries of classical sequence analysis, computational genomics exploits the inherent properties of entire genomes by modelling them as systems. We review recent developments in the field, discuss in some detail a number of novel approaches that take into account the genomic context and argue that progress will be made by novel knowledge representation and simulation technologies.
Collapse
Affiliation(s)
- S Tsoka
- Research Programme, The European Bioinformatics Institute, EMBL Cambridge Outstation, UK
| | | |
Collapse
|
21
|
Waugh M, Hraber P, Weller J, Wu Y, Chen G, Inman J, Kiphart D, Sobral B. The phytophthora genome initiative database: informatics and analysis for distributed pathogenomic research. Nucleic Acids Res 2000; 28:87-90. [PMID: 10592189 PMCID: PMC102488 DOI: 10.1093/nar/28.1.87] [Citation(s) in RCA: 41] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/1999] [Revised: 10/18/1999] [Accepted: 10/18/1999] [Indexed: 11/14/2022] Open
Abstract
The Phytophthora Genome Initiative (PGI) is a distributed collaboration to study the genome and evolution of a particularly destructive group of plant pathogenic oomycete, with the goal of understanding the mechanisms of infection and resistance. NCGR provides informatics support for the collaboration as well as a centralized data repository. In the pilot phase of the project, several investigators prepared Phytophthora infestans and Phytophthora sojae EST and Phytophthora sojae BAC libraries and sent them to another laboratory for sequencing. Data from sequencing reactions were transferred to NCGR for analysis and curation. An analysis pipeline transforms raw data by performing simple analyses (i.e., vector removal and similarity searching) that are stored and can be retrieved by investigators using a web browser. Here we describe the database and access tools, provide an overview of the data therein and outline future plans. This resource has provided a unique opportunity for the distributed, collaborative study of a genus from which relatively little sequence data are available. Results may lead to insight into how better to control these pathogens. The homepage of PGI can be accessed at http:www.ncgr.org/pgi, with database access through the database access hyperlink.
Collapse
Affiliation(s)
- M Waugh
- The National Center for Genome Resources, 1800A Old Pecos Trail, Santa Fe, NM 87505, USA
| | | | | | | | | | | | | | | |
Collapse
|
22
|
Lin W, Lai CH, Tang CJ, Huang CJ, Tang TK. Identification and gene structure of a novel human PLZF-related transcription factor gene, TZFP. Biochem Biophys Res Commun 1999; 264:789-95. [PMID: 10544010 DOI: 10.1006/bbrc.1999.1594] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
A novel cDNA clone was identified through yeast two-hybrid experiments. Following cross-examination between the cDNA clones, EST clones, and the cosmid clone, we could digitally assemble a new zinc finger transcription factor gene. This predicted gene has a cDNA size of about 1960 bp and is translated into a 487-amino-acid protein. According to database analysis, this gene contains three C2H2 zinc finger motifs and is highly related to human PLZF (promyelocytic leukemia zinc finger protein). The full-length coding region of the gene was isolated, and its sequences were confirmed by DNA sequencing. Interestingly, one splicing variant lacking exon III was also identified. Northern blot analysis revealed that this gene is mainly expressed in human testis. In conclusion, we have identified a new member of the PLZF zinc finger protein family, the testis zinc finger protein (TZFP), which is mainly expressed in testis tissue.
Collapse
Affiliation(s)
- W Lin
- Institute of Biomedical Sciences, Academia Sinica, Taipei, 115, Taiwan.
| | | | | | | | | |
Collapse
|
23
|
Ruault M, Trichet V, Gimenez S, Boyle S, Gardiner K, Rolland M, Roizès G, De Sario A. Juxta-centromeric region of human chromosome 21 is enriched for pseudogenes and gene fragments. Gene 1999; 239:55-64. [PMID: 10571034 DOI: 10.1016/s0378-1119(99)00381-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
A physical map including four pseudogenes and 10 gene fragments and spanning 500 kb in the juxta-centromeric region of the long arm of human chromosome 21 is presented. cDNA fragments isolated from a selected cDNA library were characterized and mapped to the 831B6 YAC and to two BAC contigs that cover 250 kb of the region. An 85 kb genomic sequence located in the proximal region of the map was analyzed for putative exons. Four pseudogenes were found, including psiIGSF3, psiEIF3, psiGCT-rel whose functional copies map to chromosome 1p13, chromosome 2 and chromosome 22q11, respectively. The TTLL1 pseudogene corresponds to a new gene whose functional copy maps to chromosome 22q13. Ten gene fragments represent novel sequences that have related sequences on different human chromosomes and show 97-100% nucleotide identity to chromosome 21. These may correspond to pseudogenes on chromosome 21 and to functional genes in other chromosomes. The 85 kb genomic sequence was analyzed also for GC content, CpG islands, and repetitive sequence distribution. A GC-poor L isochore spanning 40 kb from satellite 1 was observed in the most centromeric region, next to a GC-rich H isochore that is a candidate region for the presence of functional genes. The pericentric duplication of a 7.8 kb region that is derived from the 22q13 chromosome band is described. We showed that the juxta-centromeric region of human chromosome 21 is enriched for retrotransposed pseudogenes and gene fragments transferred by interchromosome duplications, but we do not rule out the possibility that the region harbors functional genes also.
Collapse
Affiliation(s)
- M Ruault
- Séquences Répétées et Centromères Humains, CNRS UPR 1142, Institut de Biologie, Montpellier, France
| | | | | | | | | | | | | | | |
Collapse
|
24
|
Bailey LC, Searls DB, Overton GC. Analysis of EST-driven gene annotation in human genomic sequence. Genome Res 1998; 8:362-76. [PMID: 9548972 DOI: 10.1101/gr.8.4.362] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
We have performed a systematic analysis of gene identification in genomic sequence by similarity search against expressed sequence tags (ESTs) to assess the suitability of this method for automated annotation of the human genome. A BLAST-based strategy was constructed to examine the potential of this approach, and was applied to test sets containing all human genomic sequences longer than 5 kb in public databases, plus 300 kb of exhaustively characterized benchmark sequence. At high stringency, 70%-90% of all annotated genes are detected by near-identity to EST sequence; >95% of ESTs aligning with well-annotated sequences overlap a gene. These ESTs provide immediate access to the corresponding cDNA clones for follow-up laboratory verification and subsequent biologic analysis. At lower stringency, up to 97% of annotated genes were identified by similarity to ESTs. The apparent false-positive rate rose to 55% of ESTs among all sequences and 20% among benchmark sequences at the lowest stringency, indicating that many genes in public database entries are unannotated. Approximately half of the alignments span multiple exons, and thus aid in the construction of gene predictions and elucidation of alternative splicing. In addition, ESTs from multiple cDNA libraries frequently cluster over genes, providing a starting point for crude expression profiles. Clone IDs may be used to form EST pairs, and particularly to extend models by associating alignments of lower stringency with high-quality alignments. These results demonstrate that EST similarity search is a practical general-purpose annotation technique that complements pattern recognition methods as a tool for gene characterization.
Collapse
Affiliation(s)
- L C Bailey
- Computational Biology and Informatics Laboratory, Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 19104, USA.
| | | | | |
Collapse
|