1
|
Pearson A, Lladser ME. On latent idealized models in symbolic datasets: unveiling signals in noisy sequencing data. J Math Biol 2023; 87:26. [PMID: 37428265 DOI: 10.1007/s00285-023-01961-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2020] [Revised: 06/19/2023] [Accepted: 06/25/2023] [Indexed: 07/11/2023]
Abstract
Data taking values on discrete sample spaces are the embodiment of modern biological research. "Omics" experiments based on high-throughput sequencing produce millions of symbolic outcomes in the form of reads (i.e., DNA sequences of a few dozens to a few hundred nucleotides). Unfortunately, these intrinsically non-numerical datasets often deviate dramatically from natural assumptions a practitioner might make, and the possible sources of this deviation are usually poorly characterized. This contrasts with numerical datasets where Gaussian-type errors are often well-justified. To overcome this hurdle, we introduce the notion of latent weight, which measures the largest expected fraction of samples from a probabilistic source that conform to a model in a class of idealized models. We examine various properties of latent weights, which we specialize to the class of exchangeable probability distributions. As proof of concept, we analyze DNA methylation data from the 22 human autosome pairs. Contrary to what is usually assumed in the literature, we provide strong evidence that highly specific methylation patterns are overrepresented at some genomic locations when latent weights are taken into account.
Collapse
Affiliation(s)
- Antony Pearson
- Department of Applied Mathematics, University of Colorado Boulder, Boulder, CO, USA
| | - Manuel E Lladser
- Department of Applied Mathematics, University of Colorado Boulder, Boulder, CO, USA.
| |
Collapse
|
2
|
Error and error mitigation in low-coverage genome assemblies. PLoS One 2011; 6:e17034. [PMID: 21340033 PMCID: PMC3038916 DOI: 10.1371/journal.pone.0017034] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2010] [Accepted: 01/10/2011] [Indexed: 11/19/2022] Open
Abstract
The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ∼2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1–4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.
Collapse
|
3
|
Antonov I, Borodovsky M. Genetack: frameshift identification in protein-coding sequences by the Viterbi algorithm. J Bioinform Comput Biol 2010; 8:535-51. [PMID: 20556861 DOI: 10.1142/s0219720010004847] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2009] [Revised: 02/12/2010] [Accepted: 02/13/2010] [Indexed: 11/18/2022]
Abstract
We describe a new program for ab initio frameshift detection in protein-coding nucleotide sequences. The task is to distinguish the same strand overlapping ORFs that occur in the sequence due to a presence of a frameshifted gene from the same strand overlapping ORFs that encompass true overlapping or adjacent genes. The GeneTack program uses a hidden Markov model (HMM) of genomic sequence with possibly frameshifted protein-coding regions. The Viterbi algorithm finds the maximum likelihood path that discriminates between true adjacent genes and those adjacent protein-coding regions that just appear to be separate entities due to frameshifts. Therefore, the program can identify spurious predictions made by a conventional gene-finding program misled by a frameshift. We tested GeneTack as well as two earlier developed programs FrameD and FSFind on 17 prokaryotic genomes with frameshifts introduced randomly into known genes. We observed that the average frameshift prediction accuracy of GeneTack, in terms of (Sn + Sp)/2 values, was higher by a significant margin than the accuracy of two other programs. In addition, we observed that the average accuracy of GeneTack is favorably compared with the accuracy of the FSFind-BLAST program that uses protein database search to verify predicted frameshifts, even though GeneTack does not use external evidence. GeneTack is freely available at http://topaz.gatech.edu/GeneTack/.
Collapse
Affiliation(s)
- Ivan Antonov
- Division of Computational Science and Engineering, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, Georgia 30332-0280, USA.
| | | |
Collapse
|
4
|
Gupta N, Benhamida J, Bhargava V, Goodman D, Kain E, Kerman I, Nguyen N, Ollikainen N, Rodriguez J, Wang J, Lipton MS, Romine M, Bafna V, Smith RD, Pevzner PA. Comparative proteogenomics: combining mass spectrometry and comparative genomics to analyze multiple genomes. Genome Res 2008; 18:1133-42. [PMID: 18426904 DOI: 10.1101/gr.074344.107] [Citation(s) in RCA: 94] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Recent proliferation of low-cost DNA sequencing techniques will soon lead to an explosive growth in the number of sequenced genomes and will turn manual annotations into a luxury. Mass spectrometry recently emerged as a valuable technique for proteogenomic annotations that improves on the state-of-the-art in predicting genes and other features. However, previous proteogenomic approaches were limited to a single genome and did not take advantage of analyzing mass spectrometry data from multiple genomes at once. We show that such a comparative proteogenomics approach (like comparative genomics) allows one to address the problems that remained beyond the reach of the traditional "single proteome" approach in mass spectrometry. In particular, we show how comparative proteogenomics addresses the notoriously difficult problem of "one-hit-wonders" in proteomics, improves on the existing gene prediction tools in genomics, and allows identification of rare post-translational modifications. We therefore argue that complementing DNA sequencing projects by comparative proteogenomics projects can be a viable approach to improve both genomic and proteomic annotations.
Collapse
Affiliation(s)
- Nitin Gupta
- Bioinformatics Program, University of California San Diego, La Jolla, California 92093, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
5
|
Masoom H, Datta S, Asif A, Cunningham L, Wu G. A Fast Algorithm for Detecting Frame Shifts in DNA sequences. 2006 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND BIOINFORMATICS AND COMPUTATIONAL BIOLOGY 2006. [DOI: 10.1109/cibcb.2006.330971] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/19/2023]
|
6
|
Perrodou E, Deshayes C, Muller J, Schaeffer C, Van Dorsselaer A, Ripp R, Poch O, Reyrat JM, Lecompte O. ICDS database: interrupted CoDing sequences in prokaryotic genomes. Nucleic Acids Res 2006; 34:D338-43. [PMID: 16381882 PMCID: PMC1347423 DOI: 10.1093/nar/gkj060] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Unrecognized frameshifts, in-frame stop codons and sequencing errors lead to Interrupted CoDing Sequence (ICDS) that can seriously affect all subsequent steps of functional characterization, from in silico analysis to high-throughput proteomic projects. Here, we describe the Interrupted CoDing Sequence database containing ICDS detected by a similarity-based approach in 80 complete prokaryotic genomes. ICDS can be retrieved by species browsing or similarity searches via a web interface (http://www-bio3d-igbmc.u-strasbg.fr/ICDS/). The definition of each interrupted gene is provided as well as the ICDS genomic localization with the surrounding sequence. Furthermore, to facilitate the experimental characterization of ICDS, we propose optimized primers for re-sequencing purposes. The database will be regularly updated with additional data from ongoing sequenced genomes. Our strategy has been validated by three independent tests: (i) ICDS prediction on a benchmark of artificially created frameshifts, (ii) comparison of predicted ICDS and results obtained from the comparison of the two genomic sequences of Bacillus licheniformis strain ATCC 14580 and (iii) re-sequencing of 25 predicted ICDS of the recently sequenced genome of Mycobacterium smegmatis. This allows us to estimate the specificity and sensitivity (95 and 82%, respectively) of our program and the efficiency of primer determination.
Collapse
Affiliation(s)
| | - Caroline Deshayes
- Inserm-UMR 570, Unité de Pathogénie des Infections SystémiquesGroupe Avenir, Paris Cedex 15, F-75730, France
| | | | - Christine Schaeffer
- Laboratoire de Spectrométrie de Masse Bio-Organique (LSMBO) UMR 7512, ECPM25 rue Becquerel, Strasbourg F-67087 Cedex 2, France
| | - Alain Van Dorsselaer
- Laboratoire de Spectrométrie de Masse Bio-Organique (LSMBO) UMR 7512, ECPM25 rue Becquerel, Strasbourg F-67087 Cedex 2, France
| | | | | | - Jean-Marc Reyrat
- Inserm-UMR 570, Unité de Pathogénie des Infections SystémiquesGroupe Avenir, Paris Cedex 15, F-75730, France
| | - Odile Lecompte
- To whom correspondence should be addressed. Tel: +33 3 88 65 32 00; Fax: +33 3 88 65 32 01;
| |
Collapse
|
7
|
Zheng Y, Roberts RJ, Kasif S. Segmentally variable genes: a new perspective on adaptation. PLoS Biol 2004; 2:E81. [PMID: 15094797 PMCID: PMC387263 DOI: 10.1371/journal.pbio.0020081] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2003] [Accepted: 01/20/2004] [Indexed: 11/30/2022] Open
Abstract
Genomic sequence variation is the hallmark of life and is key to understanding diversity and adaptation among the numerous microorganisms on earth. Analysis of the sequenced microbial genomes suggests that genes are evolving at many different rates. We have attempted to derive a new classification of genes into three broad categories: lineage-specific genes that evolve rapidly and appear unique to individual species or strains; highly conserved genes that frequently perform housekeeping functions; and partially variable genes that contain highly variable regions, at least 70 amino acids long, interspersed among well-conserved regions. The latter we term segmentally variable genes (SVGs), and we suggest that they are especially interesting targets for biochemical studies. Among these genes are ones necessary to deal with the environment, including genes involved in host–pathogen interactions, defense mechanisms, and intracellular responses to internal and environmental changes. For the most part, the detailed function of these variable regions remains unknown. We propose that they are likely to perform important binding functions responsible for protein–protein, protein–nucleic acid, or protein–small molecule interactions. Discerning their function and identifying their binding partners may offer biologists new insights into the basic mechanisms of adaptation, context-dependent evolution, and the interaction between microbes and their environment. Segmentally variable genes show a mosaic pattern of one or more rapidly evolving, variable regions. Discerning their function may provide new insights into the forces that shape genome diversity and adaptation
Collapse
Affiliation(s)
- Yu Zheng
- Bioinformatics Graduate Program, Boston University, Boston, Massachusetts, USA.
| | | | | |
Collapse
|
8
|
Abstract
Using a scientific measurement without an estimate of its error is like lending money to a stranger. Given the explosion in nucleic acid and protein sequence and structural data, what risks are the scientific and medical communities running in using these databases. Is there an 'ombudsman' who speaks for the users of the data? CODATA, the Committee on Data for Science and Technology of the International Council of Scientific Unions was established to improve the quality, reliability, processing, management, and accessibility of data for science and technology. The CODATA Task Group on Biological Macromolecules has surveyed quality control procedures of archival databanks in molecular biology. Our role is 'to advise, to be consulted, and to warn.' This report describes the kinds and extents of errors that may appear in nucleic acid and protein databases, and presents an agenda for future work to improve the quality of these databases. The results of the survey appear on the webhttp://www.codata.org/codata/tgreports/ tg_reps.html.
Collapse
|
9
|
Usuka J, Brendel V. Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. J Mol Biol 2000; 297:1075-85. [PMID: 10764574 DOI: 10.1006/jmbi.2000.3641] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Gene identification in genomic DNA from eukaryotes is complicated by the vast combinatorial possibilities of potential exon assemblies. If the gene encodes a protein that is closely related to known proteins, gene identification is aided by matching similarity of potential translation products to those target proteins. The genomic DNA and protein sequences can be aligned directly by scoring the implied residues of in-frame nucleotide triplets against the protein residues in conventional ways, while allowing for long gaps in the alignment corresponding to introns in the genomic DNA. We describe a novel method for such spliced alignment. The method derives an optimal alignment based on scoring for both sequence similarity of the predicted gene product to the protein sequence and intrinsic splice site strength of the predicted introns. Application of the method to a representative set of 50 known genes from Arabidopsis thaliana showed significant improvement in prediction accuracy compared to previous spliced alignment methods. The method is also more accurate than ab initio gene prediction methods, provided sufficiently close target proteins are available. In view of the fast growth of public sequence repositories, we argue that close targets will be available for the majority of novel genes, making spliced alignment an excellent practical tool for high-throughput automated genome annotation.
Collapse
Affiliation(s)
- J Usuka
- Department of Chemistry, Stanford University, Stanford, CA, 94305, USA
| | | |
Collapse
|
10
|
Médigue C, Rose M, Viari A, Danchin A. Detecting and analyzing DNA sequencing errors: toward a higher quality of the Bacillus subtilis genome sequence. Genome Res 1999; 9:1116-27. [PMID: 10568751 PMCID: PMC310837 DOI: 10.1101/gr.9.11.1116] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
During the determination of a DNA sequence, the introduction of artifactual frameshifts and/or in-frame stop codons in putative genes can lead to misprediction of gene products. Detection of such errors with a method based on protein similarity matching is only possible when related sequences are available in databases. Here, we present a method to detect frameshift errors in DNA sequences that is based on the intrinsic properties of the coding sequences. It combines the results of two analyses, the search for translational initiation/termination sites and the prediction of coding regions. This method was used to screen the complete Bacillus subtilis genome sequence and the regions flanking putative errors were resequenced for verification. This procedure allowed us to correct the sequence and to analyze in detail the nature of the errors. Interestingly, in several cases in-frame termination codons or frameshifts were not sequencing errors but confirmed to be present in the chromosome, indicating that the genes are either nonfunctional (pseudogenes) or subject to regulatory processes such as programmed translational frameshifts. The method can be used for checking the quality of the sequences produced by any prokaryotic genome sequencing project.
Collapse
Affiliation(s)
- C Médigue
- Institut Pasteur REG, F-75724 Paris Cedex 15, France. claudine.medigue @snv.jussieu.fr
| | | | | | | |
Collapse
|
11
|
Quentin Y, Voiblet C, Martin F, Fichant G. Protein-coding region discovery in organisms underrepresented in databases. COMPUTERS & CHEMISTRY 1999; 23:209-17. [PMID: 10404616 DOI: 10.1016/s0097-8485(99)00016-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
The prediction of coding sequences has received a lot of attention during the last decade. We can distinguish two kinds of methods, those that rely on training with sets of example and counter-example sequences, and those that exploit the intrinsic properties of the DNA sequences to be analyzed. The former are generally more powerful but their domains of application are limited by the availability of a training set. The latter avoid this drawback but can only be applied to sequences that are long enough to allow computation of the statistics. Here, we present a method that fills the gap between the two approaches. A learning step is applied using a set of sequences that are assumed to contain coding and non-coding regions, but with the boundaries of these regions unknown. A test step then uses the discriminant function obtained during the learning to predict coding regions in sequences from the same organism. The learning relies upon a correspondence analysis and prediction is presented on a graphical display. The method has been evaluated on a sample of yeast sequences, and the analysis of a set of expressed sequence tags from the Eucalyptus globulus-Pisolithus tinctorius ectomycorrhiza illustrates the relevance of the approach in its biological context.
Collapse
|
12
|
Birney E, Thompson JD, Gibson TJ. PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Res 1996; 24:2730-9. [PMID: 8759004 PMCID: PMC145991 DOI: 10.1093/nar/24.14.2730] [Citation(s) in RCA: 117] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
DNA translation frames can be disrupted for several reasons, including: (i) errors in sequence determination; (ii) RNA processing, such as intron removal and guide RNA editing; (iii) less commonly, polymerase frameshifting during transcription or ribosomal frameshifting during translation. Frameshifts frequently confound computational activities involving homologous sequences, such as database searches and inferences on structure, function or phylogeny made from multiple alignments. A dynamic alignment algorithm is reported here which compares a protein profile (a residue scoring matrix for one or more aligned sequences) against the three translation frames of a DNA strand, allowing frameshifting. The algorithm has been incorporated into a new package, WiseTools, for comparison of biological sequences. A protein profile can be compared against either a DNA sequence or a protein sequence. The program PairWise may be used interactively for alignment of any two sequence inputs. SearchWise can perform combinations of searches through DNA or protein databases by a protein profile or DNA sequence. Routine application of the programs has revealed a set of database entries with frameshifts caused by errors in sequence determination.
Collapse
Affiliation(s)
- E Birney
- European Molecular Biology Laboratory, Heidelberg, Germany
| | | | | |
Collapse
|
13
|
Xu Y, Mural RJ, Uberbacher EC. An iterative algorithm for correcting sequencing errors in DNA coding regions. J Comput Biol 1996; 3:333-44. [PMID: 8891953 DOI: 10.1089/cmb.1996.3.333] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
Insertion and deletion (indel) sequencing errors in DNA coding regions disrupt DNA-to-protein translation frames, and hence make most frame-sensitive coding recognition approaches fail. This paper extends the authors' previous work on indel detection and "correction" algorithms, and presents a more effective algorithm for localizing indels that appear in DNA coding regions and "correcting" the located indels by inserting or deleting DNA bases. The algorithm localizes indels by discovering changes of the preferred translation frames within presumed coding regions, and then "corrects" them to restore a consistent translation frame within each coding region. An iterative strategy is exploited to repeatedly localize and "correct" indels until no more indels can be found. Test results have shown that this improved algorithm can detect and "correct" more indels while not worsening the rate of introduction of false indels when compared to the authors' previous work.
Collapse
Affiliation(s)
- Y Xu
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Tennessee 37831-6364, USA.
| | | | | |
Collapse
|
14
|
Blanquer-Maumont A, Crouau-Roy B. Polymorphism, monomorphism, and sequences in conserved microsatellites in primate species. J Mol Evol 1995; 41:492-7. [PMID: 7563137 DOI: 10.1007/bf00160321] [Citation(s) in RCA: 49] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Dimeric short tandem repeats are a source of highly polymorphic markers in the mammalian genome. Genetic variation at these hypervariable loci is extensively used for linkage analysis, for the identification of individuals, and may be useful for interpopulation and interspecies studies. In this paper, we analyze the variability and the sequences of a segment including three microsatellites, first described in man, in several species of primates (chimpanzee, orangutan, gibbon, and macaque) using the heterologous primers (man primers). This region is located on the human chromosome 6p, near the tumor necrosis factor genes, in the major histocompatibility complex. The fact that these primers work in all species studied indicates that they are conserved throughout the different lineages of the two superfamilies, the Hominoidea and the Cercopithecidea, represented by the macaques. However, the intervening sequence displays intraspecific and interspecific variability. The sites of base substitutions and the insertion/deletion events are not evenly distributed within this region. The data suggest that it is necessary to have a minimal number of repeats to increase the rate of mutation sufficiently to allow the development of polymorphism. In some species, the microsatellites present single base variations which reduce the number of contiguous repeats, thus apparently slowing the rate of additional slippage events. Species with such variations or a low number of repeats are monomorphic. These microsatellite sequences are informative in the comparison of closely related species and reflect the phylogeny of the Old World monkeys, apes, and man.
Collapse
Affiliation(s)
- A Blanquer-Maumont
- CNRS-CIGH (Center of Immunology and Human Genetic), UPR 8291, CHU Purpan, Toulouse, France
| | | |
Collapse
|
15
|
Fichant GA, Quentin Y. A frameshift error detection algorithm for DNA sequencing projects. Nucleic Acids Res 1995; 23:2900-8. [PMID: 7659513 PMCID: PMC307128 DOI: 10.1093/nar/23.15.2900] [Citation(s) in RCA: 21] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
During the determination of DNA sequences, frameshift errors are not the most frequent but they are the most bothersome as they corrupt the amino acid sequence over several residues. Detection of such errors by sequence alignment is only possible when related sequences are found in the databases. To avoid this limitation, we have developed a new tool based on the distribution of non-overlapping 3-tuples or 6-tuples in the three frames of an ORF. The method relies upon the result of a correspondence analysis. It has been extensively tested on Bacillus subtilis and Saccharomyces cerevisiae sequences and has also been examined with human sequences. The results indicate that it can detect frameshift errors affecting as few as 20 bp with a low rate of false positives (no more than 1.0/1000 bp scanned). The proposed algorithm can be used to scan a large collection of data, but it is mainly intended for laboratory practice as a tool for checking the quality of the sequences produced during a sequencing project.
Collapse
Affiliation(s)
- G A Fichant
- Institut de Génétique et Microbiologie, Université Paris-Sud, Orsay, France
| | | |
Collapse
|
16
|
Bork P, Ouzounis C, Casari G, Schneider R, Sander C, Dolan M, Gilbert W, Gillevet PM. Exploring the Mycoplasma capricolum genome: a minimal cell reveals its physiology. Mol Microbiol 1995; 16:955-67. [PMID: 7476192 DOI: 10.1111/j.1365-2958.1995.tb02321.x] [Citation(s) in RCA: 68] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
We report on the analysis of 214kb of the parasitic eubacterium Mycoplasma capricolum sequenced by genomic walking techniques. The 287 putative proteins detected to date represent about half of the estimated total number of 500 predicted for this organism. A large fraction of these (75%) can be assigned a likely function as a result of similarity searches. Several important features of the functional organization of this small genome are already apparent. Among these are (i) the expected relatively large number of enzymes involved in metabolic transport and activation, for efficient use of host cell nutrients; (ii) the presence of anabolic enzymes; (iii) the unexpected diversity of enzymes involved in DNA replication and repair; and (iv) a sizeable number of orthologues (82 so far) in Escherichia coli. This survey is beginning to provide a detailed view of how M. capricolum manages to maintain essential cellular processes with a genome much smaller than that of its bacterial relatives.
Collapse
Affiliation(s)
- P Bork
- Max-Delbrück-Centre for Molecular Medicine, Berlin-Buch, Germany
| | | | | | | | | | | | | | | |
Collapse
|
17
|
Abstract
DNA sequencing efforts frequently uncover genes other than the targeted ones. We have used rapid database scanning methods to search for undescribed eubacterial and archean protein coding frames in regions flanking known genes. By searching all prokaryotic DNA sequences not marked as coding for proteins or stable RNAs against the protein databases, we have identified more than 450 new examples of bacterial proteins, as well as a smaller number of possible revisions to known proteins, at a surprisingly high rate of one new protein or revision for every 24 initial DNA sequences or 8,300 nucleotides examined. Seven proteins are members of families which have not been described in prokaryotic sequences. We also describe 49 re-interpretations of existing sequence data of particular biological significance.
Collapse
Affiliation(s)
- K Robison
- Department of Cellular and Molecular Biology, Harvard University, Cambridge, Massachusetts 02138
| | | | | |
Collapse
|
18
|
Lawrence CB, Solovyev VV. Assignment of position-specific error probability to primary DNA sequence data. Nucleic Acids Res 1994; 22:1272-80. [PMID: 8165143 PMCID: PMC523653 DOI: 10.1093/nar/22.7.1272] [Citation(s) in RCA: 28] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
DNA sequence predicted from polyacrylamide gel-based technologies is inaccurate because of variations in the quality of the primary data due to limitations of the technology, and to sequence-specific variations due to nucleotide interactions within the DNA molecule and with the gel. The ability to recognize the probability of error in the primary data will be useful in reconstructing the target sequence of a DNA sequencing project, and in estimating the accuracy of the final sequence. This paper describes the use of linear discriminant analysis to assign position-specific probabilities of incorrect, over- and under-prediction of nucleotides for each predicted nucleotide position in primary sequence data generated by a gel-based DNA sequencing technology. Using this method, most of the error potential in primary sequence data can be assigned to a limited number of discrete positions. The use of probability values in the sequence reconstruction process, and in estimating the accuracy of consensus sequence determination is described.
Collapse
Affiliation(s)
- C B Lawrence
- Department of Cell Biology, Baylor College of Medicine, Houston, TX 77030
| | | |
Collapse
|
19
|
Bork P. Hundreds of ankyrin-like repeats in functionally diverse proteins: mobile modules that cross phyla horizontally? Proteins 1993; 17:363-74. [PMID: 8108379 DOI: 10.1002/prot.340170405] [Citation(s) in RCA: 393] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Based on pattern searches and systematic database screening, almost 650 different ankyrin-like (ANK) repeats from nearly all phyla have been identified; more than 150 of them are reported here for the first time. Their presence in functionally diverse proteins such as enzymes, toxins, and transcription factors strongly suggests domain shuffling, but their occurrence in prokaryotes and yeast excludes exon shuffling. The spreading mechanism remains unknown, but in at least three cases horizontal gene transfer appears to be involved. ANK repeats occur in at least four consecutive copies. The terminal repeats are more variable in sequence. One feature of the internal repeats is a predicted central hydrophobic alpha-helix, which is likely to interact with other repeats. The functions of the ankyrin-like repeats are compatible with a role in protein-protein interactions.
Collapse
Affiliation(s)
- P Bork
- Max-Delbrück-Centre of Molecular Medicine, Berlin, Germany
| |
Collapse
|
20
|
Abstract
Recent developments in the statistical analysis of DNA sequences are reviewed. The pace with which sequence data are being generated and analysed has increased with the growth of the human genome project. Two areas of activity are emphasized: attention to error rates in recorded sequences, and heterogeneity in structure of sequences. There is now empirical evidence suggesting error rates in the range 0.1%-1%, and such rates will affect evolutionary studies since these are about the rates at which DNA sequences from different individuals are expected to differ. Heterogeneity for such quantities as base composition, or lengths between successive subsequences of specified types, may be sufficient to account for observed long-range correlations between bases. The need for statistical models and analyses of DNA sequence data will continue, and will offer interesting challenges.
Collapse
Affiliation(s)
- B S Weir
- Department of Statistics, North Carolina State University, Raleigh 27695-8203
| |
Collapse
|
21
|
Beck S. Accuracy of DNA sequencing: should the sequence quality be monitored? DNA SEQUENCE : THE JOURNAL OF DNA SEQUENCING AND MAPPING 1993; 4:215-7. [PMID: 8161825 DOI: 10.3109/10425179309015635] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
|