51
|
|
52
|
Mrázek J, Bhaya D, Grossman AR, Karlin S. Highly expressed and alien genes of the Synechocystis genome. Nucleic Acids Res 2001; 29:1590-601. [PMID: 11266562 PMCID: PMC31270 DOI: 10.1093/nar/29.7.1590] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Comparisons of codon frequencies of genes to several gene classes are used to characterize highly expressed and alien genes on the SYNECHOCYSTIS: PCC6803 genome. The primary gene classes include the ensemble of all genes (average gene), ribosomal protein (RP) genes, translation processing factors (TF) and genes encoding chaperone/degradation proteins (CH). A gene is predicted highly expressed (PHX) if its codon usage is close to that of the RP/TF/CH standards but strongly deviant from the average gene. Putative alien (PA) genes are those for which codon usage is significantly different from all four classes of gene standards. In SYNECHOCYSTIS:, 380 genes were identified as PHX. The genes with the highest predicted expression levels include many that encode proteins vital for photosynthesis. Nearly all of the genes of the RP/TF/CH gene classes are PHX. The principal glycolysis enzymes, which may also function in CO(2) fixation, are PHX, while none of the genes encoding TCA cycle enzymes are PHX. The PA genes are mostly of unknown function or encode transposases. Several PA genes encode polypeptides that function in lipopolysaccharide biosynthesis. Both PHX and PA genes often form significant clusters (operons). The proteins encoded by PHX and PA genes are described with respect to functional classifications, their organization in the genome and their stoichiometry in multi-subunit complexes.
Collapse
Affiliation(s)
- J Mrázek
- Department of Mathematics, Stanford University, Stanford, CA 94305-2125, USA
| | | | | | | |
Collapse
|
53
|
Iyer LM, Aravind L, Bork P, Hofmann K, Mushegian AR, Zhulin IB, Koonin EV. Quod erat demonstrandum? The mystery of experimental validation of apparently erroneous computational analyses of protein sequences. Genome Biol 2001; 2:RESEARCH0051. [PMID: 11790254 PMCID: PMC64836 DOI: 10.1186/gb-2001-2-12-research0051] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2001] [Revised: 09/07/2001] [Accepted: 10/04/2001] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Computational predictions are critical for directing the experimental study of protein functions. Therefore it is paradoxical when an apparently erroneous computational prediction seems to be supported by experiment. RESULTS We analyzed six cases where application of novel or conventional computational methods for protein sequence and structure analysis led to non-trivial predictions that were subsequently supported by direct experiments. We show that, on all six occasions, the original prediction was unjustified, and in at least three cases, an alternative, well-supported computational prediction, incompatible with the original one, could be derived. The most unusual cases involved the identification of an archaeal cysteinyl-tRNA synthetase, a dihydropteroate synthase and a thymidylate synthase, for which experimental verifications of apparently erroneous computational predictions were reported. Using sequence-profile analysis, multiple alignment and secondary-structure prediction, we have identified the unique archaeal 'cysteinyl-tRNA synthetase' as a homolog of extracellular polygalactosaminidases, and the 'dihydropteroate synthase' as a member of the beta-lactamase-like superfamily of metal-dependent hydrolases. CONCLUSIONS In each of the analyzed cases, the original computational predictions could be refuted and, in some instances, alternative strongly supported predictions were obtained. The nature of the experimental evidence that appears to support these predictions remains an open question. Some of these experiments might signify discovery of extremely unusual forms of the respective enzymes, whereas the results of others could be due to artifacts.
Collapse
Affiliation(s)
- Lakshminarayan M Iyer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - L Aravind
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Peer Bork
- EMBL, Biocomputing, Meyerhofstrasse 1, 69117 Heidelberg, Germany
| | | | - Arcady R Mushegian
- Stowers Institute for Medical Research, 1000 E 50th Street, Kansas City, MO 64410, USA
| | - Igor B Zhulin
- School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
54
|
Poisson approximations for conditional r-scan lengths of multiple renewal processes and application to marker arrays in biomolecular sequences. J Appl Probab 2000. [DOI: 10.1017/s0021900200016053] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
This study is motivated by problems of molecular sequence comparison for multiple marker arrays with correlated distributions. In this paper, the model assumes two (or more) kinds of markers, say Markers A and B, distributed along the DNA sequence. The two primary conditions of interest are (i) many of Marker B (say ≥ m) occur, and (ii) few of Marker B (say ≤ l) occur. We title these the conditional r-scan models, and inquire on the extent to which Marker A clusters or is over-dispersed in regions satisfying condition (i) or (ii). Limiting distributions for the extremal r-scan statistics from the A array satisfying conditions (i) and (ii) are derived by extending the Chen-Stein Poisson approximation method.
Collapse
|
55
|
Chen C, Karlin S. Poisson approximations for conditional r-scan lengths of multiple renewal processes and application to marker arrays in biomolecular sequences. J Appl Probab 2000. [DOI: 10.1239/jap/1014842842] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
This study is motivated by problems of molecular sequence comparison for multiple marker arrays with correlated distributions. In this paper, the model assumes two (or more) kinds of markers, say Markers A and B, distributed along the DNA sequence. The two primary conditions of interest are (i) many of Marker B (say ≥ m) occur, and (ii) few of Marker B (say ≤ l) occur. We title these the conditional r-scan models, and inquire on the extent to which Marker A clusters or is over-dispersed in regions satisfying condition (i) or (ii). Limiting distributions for the extremal r-scan statistics from the A array satisfying conditions (i) and (ii) are derived by extending the Chen-Stein Poisson approximation method.
Collapse
|
56
|
Abstract
Scan statistics are applied to combine information on multiple contiguous genetic markers used in a genome screen for susceptibility loci. This information may be, for example, allele sharing proportions for sib pairs or logarithm of odds (lod) scores in general small families. We focus on a dichotomous outcome variable, for example, case and control individuals or affected-affected versus affected-unaffected siblings, and suitable single-marker statistics. A significant scan statistic based on the single-marker statistics represents evidence of the presence of a susceptibility gene. For a given length of the scan statistic, we assess its significance by Monte Carlo permutation tests. Comparing P values for varying lengths of scan statistics, we treat the smallest observed P value as our statistic of interest and determine its overall significance level. We applied this method to a genome screen with autism families. The result was informative and surprising: A susceptibility region was found (genome-wide significance level, P = 0.038), which is missed with conventional approaches.
Collapse
Affiliation(s)
- J Hoh
- Laboratory of Statistical Genetics, Rockefeller University, New York, NY 10021, USA
| | | |
Collapse
|
57
|
Chen C, Karlin S. $r$-scan statistics of a marker array in multiple sequences derived from a common progenitor. ANN APPL PROBAB 2000. [DOI: 10.1214/aoap/1019487507] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
58
|
Ponomarenko JV, Orlova GV, Ponomarenko MP, Lavryushev SV, Frolov AS, Zybova SV, Kolchanov NA. SELEX_DB: an activated database on selected randomized DNA/RNA sequences addressed to genomic sequence annotation. Nucleic Acids Res 2000; 28:205-8. [PMID: 10592226 PMCID: PMC102392 DOI: 10.1093/nar/28.1.205] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/1999] [Revised: 09/10/1999] [Accepted: 09/30/1999] [Indexed: 11/13/2022] Open
Abstract
SELEX_DB is a novel curated database on selected randomized DNA/RNA sequences designed for accumulation of experimental data on functional site sequences obtained by using SELEX and SELEX-like technologies from the pools of random sequences. This database also contains the programs for DNA/RNA functional site recognition within arbitrary nucleotide sequences. The first release of SELEX_DB has been installed under SRS and is available through the WWW at http://wwwmgs.bionet.nsc.ru/mgs/systems/selex/
Collapse
Affiliation(s)
- J V Ponomarenko
- Laboratory of Theoretical Genetics, Institute of Cytology, 10 Lavrentyev Avenue, Novosibirsk 630090, Russia.
| | | | | | | | | | | | | |
Collapse
|
59
|
Chopra S, Brendel V, Zhang J, Axtell JD, Peterson T. Molecular characterization of a mutable pigmentation phenotype and isolation of the first active transposable element from Sorghum bicolor. Proc Natl Acad Sci U S A 1999; 96:15330-5. [PMID: 10611384 PMCID: PMC24819 DOI: 10.1073/pnas.96.26.15330] [Citation(s) in RCA: 87] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Accumulation of red phlobaphene pigments in sorghum grain pericarp is under the control of the Y gene. A mutable allele of Y, designated as y-cs (y-candystripe), produces a variegated pericarp phenotype. Using probes from the maize p1 gene that cross-hybridize with the sorghum Y gene, we isolated the y-cs allele containing a large insertion element. Our results show that the Y gene is a member of the MYB-transcription factor family. The insertion element, named Candystripe1 (Cs1), is present in the second intron of the Y gene and shares features of the CACTA superfamily of transposons. Cs1 is 23,018 bp in size and is bordered by 20-bp terminal inverted repeat sequences. It generated a 3-bp target site duplication upon insertion within the Y gene and excised from y-cs, leaving a 2-bp footprint in two cases analyzed. Reinsertion of the excised copy of Cs1 was identified by Southern hybridization in the genome of each of seven red pericarp revertant lines tested. Cs1 is the first active transposable element isolated from sorghum. Our analysis suggests that Cs1-homologous sequences are present in low copy number in sorghum and other grasses, including sudangrass, maize, rice, teosinte, and sugarcane. The low copy number and high transposition frequency of Cs1 imply that this transposon could prove to be an efficient gene isolation tool in sorghum.
Collapse
Affiliation(s)
- S Chopra
- Department of Zoology, Iowa State University, Ames, IA 50011, USA
| | | | | | | | | |
Collapse
|
60
|
Abstract
This paper presents a survey of currently available mathematical models and algorithmical methods for trying to identify promoter sequences. The methods concern both searching in a genome for a previously defined consensus and extracting a consensus from a set of sequences. Such methods were often tailored for either eukaryotes or prokaryotes although this does not preclude use of the same method for both types of organisms. The survey therefore covers all methods; however, emphasis is placed on prokaryotic promoter sequence identification. Illustrative applications of the main extracting algorithms are given for three bacteria.
Collapse
Affiliation(s)
- A Vanet
- Institut de biologie physico-chimique, Paris, France
| | | | | |
Collapse
|
61
|
|
62
|
Abstract
We review concepts and methods for comparative analysis of complete genomes including assessments of genomic compositional contrasts based on dinucleotide and tetranucleotide relative abundance values, identifications of rare and frequent oligonucleotides, evaluations and interpretations of codon biases in several large prokaryotic genomes, and characterizations of compositional asymmetry between the two DNA strands in certain bacterial genomes. The discussion also covers means for identifying alien (e.g. laterally transferred) genes and detecting potential specialization islands in bacterial genomes.
Collapse
Affiliation(s)
- S Karlin
- Department of Mathematics, Stanford University, California 94305-2125, USA
| | | | | |
Collapse
|
63
|
Rocha EP, Viari A, Danchin A. Oligonucleotide bias in Bacillus subtilis: general trends and taxonomic comparisons. Nucleic Acids Res 1998; 26:2971-80. [PMID: 9611243 PMCID: PMC147636 DOI: 10.1093/nar/26.12.2971] [Citation(s) in RCA: 67] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
We present a general analysis of oligonucleotide usage in the complete genome of Bacillus subtilis . Several datasets were built in order to assign various biological contexts to the biased use of words and to reveal local asymmetries in word usage that may be coupled with replication, the control of gene expression and the restriction/modification system. This analysis was complemented by cross-comparisons with the complete genomes of Escherichia coli , Haemophilus influenzae and Methanococcus jannaschii . We have observed a large number of biased oligonucleotides for words of size up to 8, throughout the datasets and species, indicating that such long strict words play an important role as biological signals. We speculate that some of them are involved in interactions with DNA and/or RNA polymerases. An extensive analysis of palindrome abundances and distributions provides the surprising result that prophage-like elements embedded in the genome exhibit a smaller avoidance of restriction sites. This may reinforce a recently proposed hypothesis of a selfish gene phenomena in the transfer of restriction/modification systems in bacteria.
Collapse
Affiliation(s)
- E P Rocha
- Atelier de BioInformatique, Université Paris VI, 12 Rue Cuvier, 75005 Paris, France.
| | | | | |
Collapse
|
64
|
Daeyaert F, Moereels H, Lewi PJ. Classification and identification of proteins by means of common and specific amino acid n-tuples in unaligned sequences. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 1998; 56:221-233. [PMID: 9725648 DOI: 10.1016/s0169-2607(98)00031-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Unaligned amino acid sequences can be characterized by their composition of amino acid n-tuples (i.e. doublets, triplets, quadruplets, etc.). In this study we investigated the performance of two statistics, termed commonality and specificity, that are derived from n-tuple counts using a set of G-protein coupled receptor (GPCR) sequences. The commonality of a tuple is defined as its relative occurrence in the sequences that belong to a given GPCR subtype. The specificity of a tuple is derived from its relative occurrence in the sequences of a given GPCR subtype and from its relative non-occurrence in the sequences that do not belong to this subtype. A graphical presentation, termed 'polygram', is described for the visualization of common and specific tuples. The method can be applied to the classification of unknown GPCR sequences. It can also be applied to the identification of fragments of GPCRs, such as may occur in chimeric receptors. The method is generally applicable to other protein families and other types of coding.
Collapse
Affiliation(s)
- F Daeyaert
- Center for Molecular Design, Janssen Research Foundation, Vosselaar, Belgium
| | | | | |
Collapse
|
65
|
|
66
|
Khazak V, Estojak J, Cho H, Majors J, Sonoda G, Testa JR, Golemis EA. Analysis of the interaction of the novel RNA polymerase II (pol II) subunit hsRPB4 with its partner hsRPB7 and with pol II. Mol Cell Biol 1998; 18:1935-45. [PMID: 9528765 PMCID: PMC121423 DOI: 10.1128/mcb.18.4.1935] [Citation(s) in RCA: 49] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/1997] [Accepted: 01/26/1998] [Indexed: 02/07/2023] Open
Abstract
Under conditions of environmental stress, prokaryotes and lower eukaryotes such as the yeast Saccharomyces cerevisiae selectively utilize particular subunits of RNA polymerase II (pol II) to alter transcription to patterns favoring survival. In S. cerevisiae, a complex of two such subunits, RPB4 and RPB7, preferentially associates with pol II during stationary phase; of these two subunits, RPB4 is specifically required for survival under nonoptimal growth conditions. Previously, we have shown that RPB7 possesses an evolutionarily conserved human homolog, hsRPB7, which was capable of partially interacting with RPB4 and the yeast transcriptional apparatus. Using this as a probe in a two-hybrid screen, we have now established that hsRPB4 is also conserved in higher eukaryotes. In contrast to hsRPB7, hsRPB4 has diverged so that it no longer interacts with yeast RPB7, although it partially complements rpb4- phenotypes in yeast. However, hsRPB4 associates strongly and specifically with hsRPB7 when expressed in yeast or in mammalian cells and copurifies with intact pol II. hsRPB4 expression in humans parallels that of hsRPB7, supporting the idea that the two proteins may possess associated functions. Structure-function studies of hsRPB4-hsRPB7 are used to establish the interaction interface between the two proteins. This identification completes the set of human homologs for RNA pol II subunits defined in yeast and should provide the basis for subsequent structural and functional characterization of the pol II holoenzyme.
Collapse
Affiliation(s)
- V Khazak
- Division of Basic Sciences, Fox Chase Cancer Center, Philadelphia, Pennsylvania 19111, USA
| | | | | | | | | | | | | |
Collapse
|
67
|
Zhang C, Cornette JL, Berzofsky JA, DeLisi C. The organization of human leucocyte antigen class I epitopes in HIV genome products: implications for HIV evolution and vaccine design. Vaccine 1997; 15:1291-302. [PMID: 9302734 DOI: 10.1016/s0264-410x(97)00040-6] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Knowledge of human leucocyte antigen (HLA) peptide binding motifs permits rapid selection of candidate viral protein fragments for induction of T cell-mediated immunity. A search for HLA class I peptide binding motifs in structural proteins of human immunodeficiency virus (HIV) of different genetic lineages provides a map of the genetic organization of potential T cell antigenic sites, and at the same time identifies all motifs in highly conserved regions of HIV-1 env, gag and pol. The density of motifs is anomalous at both the high and low end of the spectrum: local organization is characterized by clustering in relatively short regions, while large scale organization is characterized by anomalously long runs between motifs. The former is expected simply due to the fact that motifs often have overlapping anchor residue sets. A detailed statistical analysis of the latter, however, shows that the length of the runs cannot be accounted for by chance alone. Although motif clusters show no preference to be in either conserved or variable regions, low motif density stretches occur preferentially in variable portions of the protein sequence, which suggests that the virus may be mutating to evade the cellular arm of the immune system.
Collapse
Affiliation(s)
- C Zhang
- Department of Biomedical Engineering, Boston University, MA 02215, USA
| | | | | | | |
Collapse
|
68
|
Karlin S, Mrázek J, Campbell AM. Compositional biases of bacterial genomes and evolutionary implications. J Bacteriol 1997; 179:3899-913. [PMID: 9190805 PMCID: PMC179198 DOI: 10.1128/jb.179.12.3899-3913.1997] [Citation(s) in RCA: 329] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
We compare and contrast genome-wide compositional biases and distributions of short oligonucleotides across 15 diverse prokaryotes that have substantial genomic sequence collections. These include seven complete genomes (Escherichia coli, Haemophilus influenzae, Mycoplasma genitalium, Mycoplasma pneumoniae, Synechocystis sp. strain PCC6803, Methanococcus jannaschii, and Pyrobaculum aerophilum). A key observation concerns the constancy of the dinucleotide relative abundance profiles over multiple 50-kb disjoint contigs within the same genome. (The profile is rhoXY* = fXY*/fX*fY* for all XY, where fX* denotes the frequency of the nucleotide X and fY* denotes the frequency of the dinucleotide XY, both computed from the sequence concatenated with its inverted complementary sequence.) On the basis of this constancy, we refer to the collection [rhoXY*] as the genome signature. We establish that the differences between [rhoXY*] vectors of 50-kb sample contigs of different genomes virtually always exceed the differences between those of the same genomes. Various di- and tetranucleotide biases are identified. In particular, we find that the dinucleotide CpG=CG is underrepresented in many thermophiles (e.g., M. jannaschii, Sulfolobus sp., and M. thermoautotrophicum) but overrepresented in halobacteria. TA is broadly underrepresented in prokaryotes and eukaryotes, but normal counts appear in Sulfolobus and P. aerophilum sequences. More than for any other bacterial genome, palindromic tetranucleotides are underrepresented in H. influenzae. The M. jannaschii sequence is unprecedented in its extreme underrepresentation of CTAG tetranucleotides and in the anomalous distribution of CTAG sites around the genome. Comparative analysis of numbers of long tetranucleotide microsatellites distinguishes H. influenzae. Dinucleotide relative abundance differences between bacterial sequences are compared. For example, in these assessments of differences, the cyanobacteria Synechocystis, Synechococcus, and Anabaena do not form a coherent group and are as far from each other as general gram-negative sequences are from general gram-positive sequences. The difference of M. jannaschii from low-G+C gram-positive proteobacteria is one-half of the difference from gram-negative proteobacteria. Interpretations and hypotheses center on the role of the genome signature in highlighting similarities and dissimilarities across different classes of prokaryotic species, possible mechanisms underlying the genome signature, the form and level of genome compositional flux, the use of the genome signature as a chronometer of molecular phylogeny, and implications with respect to the three putative eubacterial, archaeal, and eukaryote domains of life and to the origin and early evolution of eukaryotes.
Collapse
Affiliation(s)
- S Karlin
- Department of Mathematics, Stanford University, California 94305-2125, USA
| | | | | |
Collapse
|
69
|
Graul RC, Sadée W. Sequence alignments of the H(+)-dependent oligopeptide transporter family PTR: inferences on structure and function of the intestinal PET1 transporter. Pharm Res 1997; 14:388-400. [PMID: 9144720 DOI: 10.1023/a:1012070726480] [Citation(s) in RCA: 22] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
PURPOSE To study the structure and function of the intestinal H+/ peptide transporter PET1, we compared its amino acid sequence with those of related transporters belonging to the oligopeptide transporter family PTR, and with more distant transporter families. METHODS We have developed a new approach to the sequence analysis of proteins with multiple transmembrane domains (TMDs) which takes into account the repeated TMD-loop topology. In addition to conventional analyses of the entire sequence, each TMD and its adjacent loop residues (= TMD segments) were analyzed separately as independent structural units. In combination with hydropathy analysis, this approach reveals any changes in the order of the TMD segments in the primary structure and permits TMD alignments among divergent structures even if rearrangements of the order of TMD segments have occurred in the course of evolution. RESULTS Alignments of TMD segments indicate that the TMD order in PTR transporters may have changed in the process of evolution. Consideration of such changes permits the alignment of homologous TMD segments from PTR transporters belonging to distant akaryotic and eukaryotic phyla. Multiple alignments of TMDs reveal several highly conserved regions that may play a role in transporter function. In comparing the PTR transporters with other transporter gene families, alignment scores using the entire primary structure are too low to support a finding of probable homology. However, statistically significant alignments were observed among individual TMD segments if one disregards the order in which they occur in the primary structure. CONCLUSIONS Our results support the hypothesis that the PTR transporters may have evolved by rearrangement, duplication, or insertions and deletions of TMD segments as independent modules. This modular structure suggests new alignment strategies for determining functional domains and testing relationship among distant transporter families.
Collapse
Affiliation(s)
- R C Graul
- Department of Biopharmaceutical Sciences, University of California, San Francisco 94143-0446, USA
| | | |
Collapse
|
70
|
Tyler KD, Wang G, Tyler SD, Johnson WM. Factors affecting reliability and reproducibility of amplification-based DNA fingerprinting of representative bacterial pathogens. J Clin Microbiol 1997; 35:339-46. [PMID: 9003592 PMCID: PMC229576 DOI: 10.1128/jcm.35.2.339-346.1997] [Citation(s) in RCA: 203] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Affiliation(s)
- K D Tyler
- Bureau of Microbiology, Laboratory Centre for Disease Control, Health Canada, Ottawa, Ontario
| | | | | | | |
Collapse
|
71
|
Abstract
Strategies are described for constructing pharmacophoric 3D database queries, based on a series of active and inactive analogs. The results are highly selective database queries, which are consistent with the generally accepted pharmacophore for a number of systems. The foundation of these strategies is the method of Mayer, Naylor, Motoc and Marshall [J. Comput.-Aided Mol. Design, 1 (1987) 3] for inferring a unique binding geometry for angiotensin-converting enzyme (ACE) inhibitors. The strategies described here generalize their approach to cases where the chemical features responsible for binding are not a priori apparent, and to cases where the binding geometry deduced by that method is not unique. The key new insight, the selectivity principle, is to rank the multiple solutions produced by the method of Mayer et al. by their selectivity, a value that is related to the proportion of a database that is returned as a database hit list. Retrospective analyses are described for D2-antagonists, ACE inhibitors, fibrinogen antagonists, and beta 2-antagonists.
Collapse
Affiliation(s)
- J H Van Drie
- Pharmacia & Upjohn Inc., Kalamazoo, MI 49001, USA
| |
Collapse
|
72
|
Caetano-Anollés G. Scanning of nucleic acids by in vitro amplification: new developments and applications. Nat Biotechnol 1996; 14:1668-74. [PMID: 9634849 DOI: 10.1038/nbt1296-1668] [Citation(s) in RCA: 24] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Nucleic acids can be characterized using a variety of "fingerprinting" techniques usually based on nucleic acid hybridization or enzymatic amplification. The scanning of nucleic acids by amplification with arbitrary oligonucleotide primers has become popular because it can generate simple-to-complex patterns from anonymous DNA or RNA templates without requiring prior knowledge of nucleic acid sequence or cloned or characterized probes. Discrete loci are amplified within genomic DNA, DNA complementary to mRNA populations (cDNA), cloned DNA fragments, and even PCR products. The potential and limitations of the various genome scanning techniques, novel improvements, and their recent use in comparative and experimental biology applications, including the analysis of plant and bacterial genomes are discussed.
Collapse
Affiliation(s)
- G Caetano-Anollés
- Department of Ornamental Horticulture and Landscape Design, University of Tennessee, Knoxville 37901-1071, USA
| |
Collapse
|
73
|
Karlin S, Zhu ZY. Characterizations of diverse residue clusters in protein three-dimensional structures. Proc Natl Acad Sci U S A 1996; 93:8344-9. [PMID: 8710873 PMCID: PMC38673 DOI: 10.1073/pnas.93.16.8344] [Citation(s) in RCA: 43] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
We present new methods for identifying and analyzing statistically significant residue clusters that occur in three-dimensional (3D) protein structures. Residue clusters of different kinds occur in many contexts. They often feature the active site (e.g., in substrate binding), the interface between polypeptide units of protein complexes, regions of protein-protein and protein-nucleic acid interactions, or regions of metal ion coordination. The methods are illustrated with 3D clusters centering on four themes. (i) Acidic or histidine-acidic clusters associated with metal ions. (ii) Cysteine clusters including coordination of metals such as zinc or iron-sulfur structures, cysteine knots prominent in growth factors, multiple sets of buried disulfide pairings that putatively nucleate the hydrophobic core, or cysteine clusters of mostly exposed disulfide bridges. (iii) Iron-sulfur proteins and charge clusters. (iv) 3D environments of multiple histidine residues. Study of diverse 3D residue clusters offers a new perspective on protein structure and function. The algorithms can aid in rapid identification of distinctive sites, suggest correlations among protein structures, and serve as a tool in the analysis of new structures.
Collapse
Affiliation(s)
- S Karlin
- Department of Mathematics, Stanford University, CA 94305-2125, USA
| | | |
Collapse
|
74
|
Facchiano A. Coding in noncoding frames. Trends Genet 1996; 12:168-9. [PMID: 8984730 DOI: 10.1016/0168-9525(96)20004-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
|
75
|
Thiem SM, Du X, Quentin ME, Berner MM. Identification of baculovirus gene that promotes Autographa californica nuclear polyhedrosis virus replication in a nonpermissive insect cell line. J Virol 1996; 70:2221-9. [PMID: 8642646 PMCID: PMC190062 DOI: 10.1128/jvi.70.4.2221-2229.1996] [Citation(s) in RCA: 58] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
A gene that promotes Autographa californica M nuclear polyhedrosis virus (AcMNPV) replication in IPLB-Ld652Y cells, a cell line that is nonpermissive for AcMNPV, was identified in Lymantria dispar M nuclear polyhedrosis virus (LdMNPV). Cotransfection of AcMNPV DNA and a plasmid carrying the LdMNPV gene into IPLB-Ld652Y cells results in AcMNPV replication. The gene maps between 43.3 and 43.8 map units on the 162-kbp genome of LdMNPV. It comprises a 218-codon open reading frame and encodes a polypeptide with a predicted molecular mass of 25.7 kDa. The predicted polypeptide is glutamic acid and valine rich and negatively charged, with a pI of 4.61. No protein sequence motifs were identified, and no matches with known nucleotide or peptide sequences were found in the AcMNPV genome or database searches that suggest how this gene might function. A recombinant AcMNPV bearing the LdMNPV gene overcomes a block in protein synthesis observed in AcMNPV-infected IPLB-Ld652Y cells. Using Southern blotting techniques, we were unable to identify a homolog in Orgyia pseudotsugata M nuclear polyhedrosis virus, a baculovirus that is routinely propagated in IPLB-Ld652Y cells. This suggests that the LdMNPV host range is unique among the baculoviruses studied to date. We named this gene hrf-1 (for host range factor 1).
Collapse
Affiliation(s)
- S M Thiem
- Department of Entomology, Michigan State University, East Lansing, MI 48824-1115, USA
| | | | | | | |
Collapse
|
76
|
Affiliation(s)
- L Patthy
- Institute of Enzymology, Hungarian Academy of Sciences, Budapest, Hungary
| |
Collapse
|
77
|
Forsdyke DR. Reciprocal relationship between stem-loop potential and substitution density in retroviral quasispecies under positive Darwinian selection. J Mol Evol 1995; 41:1022-37. [PMID: 8587101 DOI: 10.1007/bf00173184] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Nucleic acids have the potential to form intrastrand stem-loops if complementary bases are suitably located. Computer analyses of poliovirus and retroviral RNAs have revealed a reciprocal relationship between "statistically significant" stem-loop potential and "sequence variability." The statistically significant stem-loop potential of a nucleic acid segment has been defined as a function of the difference between the folding energy of the natural segment (FONS) and the mean folding energy of a set of randomized (shuffled) versions of the natural segment (FORS-M). Since FONS is dependent on both base composition and base order, whereas FORS-M is solely dependent on base composition (a genomic characteristic), it follows that statistically significant stem-loop potential (FORS-D) is a function of base order (a local characteristic). In retroviral genomes, as in all DNA genomes studied, positive FORS-D values are widely distributed. Thus there have been pressures on base order both to encode specific functions and to encode stem-loops. As in the case of DNA genomes under positive Darwinian selection pressure, in HIV-1 specific function appears to dominate in rapidly evolving regions. Here high sequence variability, expressed as substitution density (not indel density), is associated with negative FORS-D values (impaired base-order-dependent stem-loop potential). This suggests that in these regions HIV-1 genomes are under positive selection pressure by host defenses. The general function of stem-loops is recombination. This is a vital process if, from among members of viral "quasispecies," functional genomes are to be salvaged. Thus, for rapidly evolving RNA genomes, it is as important to conserve base-order-dependent stem-loop potential as to conserve other functions.
Collapse
Affiliation(s)
- D R Forsdyke
- Department of Biochemistry, Queen's University, Kingston, Ontario, Canada
| |
Collapse
|
78
|
Sathe SS, Harte PJ. The extra sex combs protein is highly conserved between Drosophila virilis and Drosophila melanogaster. Mech Dev 1995; 52:225-32. [PMID: 8541211 DOI: 10.1016/0925-4773(95)00403-n] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Extra sex combs (esc) is one of the Polycomb Group genes, whose products are required for long term maintenance of the spatially restricted domains of homeotic gene expression initially established by the products of the segmentation genes. We recently showed that the esc protein contains five copies of the WD motif, which in other proteins has been directly implicated in protein-protein interactions. Mutations affecting the WD repeats of the esc protein indicate that they are essential for its function as a repressor of the homeotic genes. We proposed that they may mediate interactions between esc and other Polycomb Group proteins, recruiting them to their target genes, perhaps by additional interactions with transiently expressed repressors such as hunchback. To further investigate the functional importance of the WD motifs and identify other functionally important regions of the esc protein, we have begun to determine its evolutionary conservation by characterizing the esc gene from Drosophila virilis, a distantly related Drosophila species. We show that the esc protein is highly conserved between these species, particularly its WD motifs. Their high degree of conservation, particularly at positions which are not conserved in the WD consensus derived from alignment of all known WD motifs, suggests that each of the WD repeats in the esc protein is functionally specialized and that this specialization has been highly conserved during evolution. Its highly charged N-terminus exhibits the greatest divergence, but even these differences are conservative of its predicted physical properties. These observations suggest that the esc protein is functionally compact, nearly every residue making an important contribution to its function.
Collapse
Affiliation(s)
- S S Sathe
- Department of Genetics, School of Medicine, Case Western Reserve University, Cleveland, OH 44106-4955, USA
| | | |
Collapse
|
79
|
Abstract
Hypothetical Products from Noncoding Frames (i.e., HyPNoFs) are hypothetical, not-coded proteins, translated from alternate reading frames (i.e., coding + 1 and coding + 2) of cDNAs. HyPNoFs of CD4, PKC, oncostatin, bcl-2 proto-oncogene, tumor suppressor p53, cystic fibrosis transmembrane regulator (CFTR), and tumor necrosis factors alpha and beta were searched as query sequences vs the SWISS-PROT data bank. Homology searchers carried out revealed that hypothetical products (i.e., HyPNoFs) may share high similarity with real protein products actually coded. Sequence similarity of hypothetical products to real proteins is sometimes very high, suggesting common conformational features, according to the Sander and Schneider cutoff value. This finding supports the hypothesis that eukaryotic DNA, currently considered to be monocistronic, might occasionally have polycistronic regions, carrying different protein messages on overlapping frames. As yet, polycistronic genes have been observed in viral genomes only. The presence of polycistronic regions in eukaryotic genes is likely reminiscent of an ancient strategy, rather than a present feature of the genome in eukaryotes. These data suggest that thorough investigation of HyPNoFs is likely to improve our ability to trace genes' evolution and to investigate structure-function relationships of protein and DNA sequences.
Collapse
Affiliation(s)
- A Facchiano
- Raggio Italgene S.p.A., Pomezia, Roma, Italy
| |
Collapse
|
80
|
Abstract
I discuss three recent developments in sequence analysis by the statistical method of scores. First is the identification of segments of high aggregate score in a single protein sequence. Charge clusters and hyper-charge runs are prime examples. Proteins containing hyper-charge runs are principally associated with DNA and RNA processing, chromatin structure, ion storage and exchange, and protein complex assembly. Second is the protein sequence comparisons identifying common segments having high total similarity scores. These are illustrated by comparisons within the family of prokaryotic heat shock 70 kDa proteins. Third is the scoring protocols applied to the inverse folding problem.
Collapse
Affiliation(s)
- S Karlin
- Department of Mathematics, Stanford University, CA 94305-2125, USA
| |
Collapse
|
81
|
Miller WJ, Paricio N, Hagemann S, Martínez-Sebastián MJ, Pinsker W, de Frutos R. Structure and expression of clustered P element homologues in Drosophila subobscura and Drosophila guanche. Gene 1995; 156:167-74. [PMID: 7758953 DOI: 10.1016/0378-1119(95)00013-v] [Citation(s) in RCA: 24] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Sequence relationships and functional aspects were analysed in the P element homologues of Drosophila subobscura (Ds) and D. guanche (Dg). In both species, the P homologues are clustered at a single genomic position. They lack the characteristic terminal structures of actively transposing P elements, but they have the coding capacity for a 66-kDa 'repressor-like' protein. Two different types of cluster units (G-type and A-type) can be distinguished. The A-type unit, which is present in multiple copies, is transcribed in adult flies. In contrast, the G-type unit has a much lower copy number and is apparently not expressed. In Dg, the isolated G-type sequence carries a 420-bp insertion in the promoter region, which is probably responsible for inactivation. Sequence comparisons of different cluster units show that differentiation of the two types precedes the lineage split of these species. Substitution rates of the deduced proteins reveal two distinct subregions: high variability at the N terminus and strong sequence conservation in the rest of the protein. The variable region contains motifs characteristic of DNA-binding proteins. Adaptive diversification of the cluster units towards specific binding properties might be a plausible explanation for variability in the N-termini. Both unit types have lost the weak promoter region characteristic of P transposons. In the A-type unit, a new promoter has been formed which is apparently composed of parts of insertion sequences derived from two different mobile elements.(ABSTRACT TRUNCATED AT 250 WORDS)
Collapse
Affiliation(s)
- W J Miller
- Institut für Allgemeine Biologie, Abteilung Genetik, Medizinische Fakultät der Universität Wien, Austria
| | | | | | | | | | | |
Collapse
|
82
|
Abstract
The identification of protein sequences that fold into certain known three-dimensional (3D) structures, or motifs, is evaluated through a probabilistic analysis of their one-dimensional (1D) sequences. We present a correlation method that runs in linear time and incorporates pairwise dependencies between amino acid residues at multiple distances to assess the conditional probability that a given residue is part of a given 3D structure. This method is generalized to multiple motifs, where a dynamic programming approach leads to an efficient algorithm that runs in linear time for practical problems. By this approach, we were able to distinguish (2-stranded) coiled-coil from non-coiled-coil domains and globins from nonglobins. When tested on the Brookhaven X-ray crystal structure database, the method does not produce any false-positive or false-negative predictions of coiled coils.
Collapse
Affiliation(s)
- B Berger
- Mathematics Department, Massachusetts Institute of Technology, Cambridge 02139, USA
| |
Collapse
|
83
|
Abstract
We determined the amino acid composition of proteins of Sp2 hybridoma cells by a procedure which assembles the information on the polypeptides upon two-dimensional gel electrophoresis, such that biosynthetic labeling with 20 different 3H amino acids provides the data--spot intensities--on the relative representation of the detected polypeptides. The gels were impregnated with 2,5-diphenyloxazol (PPO) and suitably exposed radiofluorographs were selected for analysis. The images originating from the 12 cultures labeled with amino acids R, A, H, I, L, K, M, F, P, S, T and Y were analysed with the Kepler image analysis system. The spot volume data of the 12 analysed patterns were corrected for the unequal labeling efficiencies of the 3H amino acids and for the various exposure times. This correction is performed by applying calibration factors based on the amino acid determination of a hydrolysate of the analysed cells. After the calibration step was applied to the data files we used the amino acid compositions of nine proteins taken from a database to establish for each of these proteins the correlation coefficients with the image analysis derived amino acid compositions of all spots. The correlation coefficients allow us to tentatively identify polypeptide spots on two-dimensional gels, while the amino acid composition of 350 investigated two-dimensional gel spots is usable as an identification tag in the gene retrieval from our cDNA libraries.
Collapse
Affiliation(s)
- J R Frey
- Basel Institute for Immunology, Switzerland
| | | | | | | |
Collapse
|
84
|
Abstract
There is no clear picture to date of the mechanisms determining nucleosome positioning. Generally, local DNA sequence signals (sequence-dependent positioning) or non-local signals (e.g. boundary effects) are possible. We have analyzed the DNA sequences of a series of positioned and mapped nucleosome cores in a systematic search for local sequence signals. The data set consists of 113 mapped nucleosome cores, mapped in vivo, in situ, or in reconstituted chromatin. The analysis focuses on the periodic distribution of sequence elements implied by each of six different published DNA structural models. We have also investigated the periodic distribution of all mono-, di-, and trinucleotides. An identical analysis was performed on a set of isolated chicken nucleosome cores (nucleosome data from the literature) that are presumably positioned due to local sequence signals. The results show that the sequences of the isolated nucleosome cores have a number of characteristic features that distinguish them clearly from randomly chosen reference DNA. This confirms that the positioning of these nucleosomes is mainly sequence-dependent (i.e., dependent on local octamer-DNA interactions) and that our algorithms are able to detect these patterns. Using the same algorithms, the sequences of the mapped nucleosome cores, however, are on average very similar to randomly chosen reference DNA. This suggests that the position of the majority of these nucleosomes can not be attributed to the sequence patterns implemented in our algorithms. The arrangement of positioned nucleosomes seems to be the result of a dynamic interplay of octamer-DNA interactions, nucleosome-nucleosome interactions and other positioning signals with varying relative contributions along the DNA.
Collapse
Affiliation(s)
- H Staffelbach
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, NM 87545
| | | | | |
Collapse
|
85
|
Wootton JC. Non-globular domains in protein sequences: automated segmentation using complexity measures. COMPUTERS & CHEMISTRY 1994; 18:269-85. [PMID: 7952898 DOI: 10.1016/0097-8485(94)85023-2] [Citation(s) in RCA: 372] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Computational methods based on mathematically-defined measures of compositional complexity have been developed to distinguish globular and non-globular regions of protein sequences. Compact globular structures in protein molecules are shown to be determined by amino acid sequences of high informational complexity. Sequences of known crystal structure in the Brookhaven Protein Data Bank differ only slightly from randomly shuffled sequences in the distribution of statistical properties such as local compositional complexity. In contrast, in the much larger body of deduced sequences in the SWISS-PROT database, approximately one quarter of the residues occur in segments of non-randomly low complexity and approximately half of the entries contain at least one such segment. Sequences of proteins with known, physicochemically-defined non-globular regions have been analyzed, including collagens, different classes of coiled-coil proteins, elastins, histones, non-histone proteins, mucins, proteoglycan core proteins and proteins containing long single solvent-exposed alpha-helices. The SEG algorithm provides an effective general method for partitioning the globular and non-globular regions of these sequences fully automatically. This method is also facilitating the discovery of new classes of long, non-globular sequence segments, as illustrated by the example of the human CAN gene product involved in tumor induction.
Collapse
Affiliation(s)
- J C Wootton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894
| |
Collapse
|
86
|
Brendel V, Karlin S. Applications of statistical criteria in protein sequence analysis: case study of yeast RNA polymerase II subunits. COMPUTERS & CHEMISTRY 1994; 18:251-3. [PMID: 7952895 DOI: 10.1016/0097-8485(94)85020-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
We have recently proposed statistical techniques to identify unusual protein sequence features. Extensive mapping of these features to particular groups of proteins may afford new ways of protein classification. Here we present a case study of such analysis by discussing special features of the amino acid sequences of yeast RNA polymerase II, the first eukaryotic RNA polymerase for which all subunits have been sequenced. Specific new suggestions derived from this analysis include: (i) based on unusual charge configurations in some of the sequences, electrostatic forces may play a significant role in subunit interactions; (ii) RPB4, on account of similar charge distribution, may well be grouped together with RNA polymerase II transcription initiation factors.
Collapse
Affiliation(s)
- V Brendel
- Department of Mathematics, Stanford University, CA 94305
| | | |
Collapse
|
87
|
Caetano-Anollés G. MAAP: a versatile and universal tool for genome analysis. PLANT MOLECULAR BIOLOGY 1994; 25:1011-1026. [PMID: 7919212 DOI: 10.1007/bf00014674] [Citation(s) in RCA: 22] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Multiple arbitrary amplicon profiling (MAAP) uses one or more oligonucleotide primers (> or = 5 nt) of arbitrary sequence to initiate DNA amplification and generate characteristic fingerprints from anonymous genomes or DNA templates. MAAP markers can be used in general fingerprinting as well as in mapping applications, either directly or as sequence-characterized amplified regions (SCARs). MAAP profiles can be tailored in the number of monomorphic and/or polymorphic products. For example, multiple endonuclease digestion of template DNA or the use of mini-hairpin primers can enhance detection of polymorphic DNA. Comparison of the expected and actual number of amplification products produced with primers differing in length, sequence and GC content from templates of varying complexity reveal severe departures from theoretical formulations with interesting implications in primer-template interaction. Extensive primer-template mismatching can occur when using templates of low complexity or long primers. Primer annealing and extension appears directed by an 8 nt 3'-terminal primer domain, requires sites with perfect homology to the first 5-6 nt fom the 3' terminus, and involves direct physical interaction between amplicon annealing sites.
Collapse
Affiliation(s)
- G Caetano-Anollés
- Institute of Agriculture, University of Tennessee, Knoxville 37901-1071
| |
Collapse
|
88
|
Scherer S, McPeek MS, Speed TP. Atypical regions in large genomic DNA sequences. Proc Natl Acad Sci U S A 1994; 91:7134-8. [PMID: 8041759 PMCID: PMC44353 DOI: 10.1073/pnas.91.15.7134] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Large genomic DNA sequences contain regions with distinctive patterns of sequence organization. We describe a method using logarithms of probabilities based on seventh-order Markov chains to rapidly identify genomic sequences that do not resemble models of genome organization built from compilations of octanucleotide usage. Data bases have been constructed from Escherichia coli and Saccharomyces cerevisiae DNA sequences of > 1000 nt and human sequences of > 10,000 nt. Atypical genes and clusters of genes have been located in bacteriophage, yeast, and primate DNA sequences. We consider criteria for statistical significance of the results, offer possible explanations for the observed variation in genome organization, and give additional applications of these methods in DNA sequence analysis.
Collapse
Affiliation(s)
- S Scherer
- Human Genome Center, Lawrence Berkeley Laboratory, Berkeley, CA 94720
| | | | | |
Collapse
|
89
|
Abstract
For the functional interpretation of genomic sequences, effective algorithms have to be developed that will recognize regions of specific function and thus will suggest experiments for their verification. As a first step, relevant data have to be collected in an appropriate database from which suitable training sets can be extracted. In this paper, I discuss the requirements for a database that collects information about regulatory DNA sequences and describe the structure and contents of such a database (TRANSFAC). This compiled information will serve as a basis for comprehensive analysis of sites that regulate transcription, e.g., by statistical methods. It will thus facilitate the recognition of regulatory genomic sequence information and the assignment of the corresponding regulators. Moreover, it will provide all relevant data about the regulating proteins which will allow to trace back transcriptional control cascades to their origin.
Collapse
Affiliation(s)
- E Wingender
- Department of Genetics, Gesellschaft für Biotechnologische Forschung, Braunschweig, Germany
| |
Collapse
|
90
|
Abstract
Nucleotide and amino acid sequences can be analyzed and compared by their oligomer compositions. Such methods are fundamentally different from comparison methods based on sequence alignment. They are analogous to the linguistic analysis of human texts. The methods have a wide range of sensitivity and can identify homologous as well as functionally and taxonomically related sequences. Significant sequence dissimilarity can also be identified enabling detection of foreign DNA sequences in genomes, genetic libraries and databases. The simplicity and speed of linguistic methods make them very suitable for database searching and maintenance and as a preliminary step to more specific and time-consuming analysis methods.
Collapse
Affiliation(s)
- S Pietrokovski
- Department of Structural Biology, Weizmann Institute of Science, Rehovot, Israel
| |
Collapse
|
91
|
Karlin S. Statistical studies of biomolecular sequences: score-based methods. Philos Trans R Soc Lond B Biol Sci 1994; 344:391-402. [PMID: 7800709 DOI: 10.1098/rstb.1994.0078] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
The massive accumulation of DNA and protein sequence data poses challenges and opportunities in terms of interpretation and analysis. This presentation reviews the method of score-based sequence analysis with the objectives of discerning distinctive segments in single sequences and identifying significant common segments in sequence comparisons. A number of new results are described here for both the theory and its applications. These include distributional theory involving several high scoring segments in single sequences, distribution formulas for general scoring regimes in multiple sequence comparisons, bounds for periodic scoring assignments, sensitivity analysis of genome composition and refinements on predicting exons and genes in DNA sequences.
Collapse
Affiliation(s)
- S Karlin
- Department of Mathematics, Stanford University, California 94305-2125
| |
Collapse
|
92
|
|
93
|
|
94
|
The gadd and MyD genes define a novel set of mammalian genes encoding acidic proteins that synergistically suppress cell growth. Mol Cell Biol 1994. [PMID: 8139541 DOI: 10.1128/mcb.14.4.2361] [Citation(s) in RCA: 295] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
A remarkable overlap was observed between the gadd genes, a group of often coordinately expressed genes that are induced by genotoxic stress and certain other growth arrest signals, and the MyD genes, a set of myeloid differentiation primary response genes. The MyD116 gene was found to be the murine homolog of the hamster gadd34 gene, whereas MyD118 and gadd45 were found to represent two separate but closely related genes. Furthermore, gadd34/MyD116, gadd45, MyD118, and gadd153 encode acidic proteins with very similar and unusual charge characteristics; both this property and a similar pattern of induction are shared with mdm2, whic, like gadd45, has been shown previously to be regulated by the tumor suppressor p53. Expression analysis revealed that they are distinguished from other growth arrest genes in that they are DNA damage inducible and suggest a role for these genes in growth arrest and apoptosis either coupled with or uncoupled from terminal differentiation. Evidence is also presented for coordinate induction in vivo by stress. The use of a short-term transfection assay, in which expression vectors for one or a combination of these gadd/MyD genes were transfected with a selectable marker into several different human tumor cell lines, provided direct evidence for the growth-inhibitory functions of the products of these genes and their ability to synergistically suppress growth. Taken together, these observations indicate that these genes define a novel class of mammalian genes encoding acidic proteins involved in the control of cellular growth.
Collapse
|
95
|
Zhan Q, Lord KA, Alamo I, Hollander MC, Carrier F, Ron D, Kohn KW, Hoffman B, Liebermann DA, Fornace AJ. The gadd and MyD genes define a novel set of mammalian genes encoding acidic proteins that synergistically suppress cell growth. Mol Cell Biol 1994; 14:2361-71. [PMID: 8139541 PMCID: PMC358603 DOI: 10.1128/mcb.14.4.2361-2371.1994] [Citation(s) in RCA: 141] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
A remarkable overlap was observed between the gadd genes, a group of often coordinately expressed genes that are induced by genotoxic stress and certain other growth arrest signals, and the MyD genes, a set of myeloid differentiation primary response genes. The MyD116 gene was found to be the murine homolog of the hamster gadd34 gene, whereas MyD118 and gadd45 were found to represent two separate but closely related genes. Furthermore, gadd34/MyD116, gadd45, MyD118, and gadd153 encode acidic proteins with very similar and unusual charge characteristics; both this property and a similar pattern of induction are shared with mdm2, whic, like gadd45, has been shown previously to be regulated by the tumor suppressor p53. Expression analysis revealed that they are distinguished from other growth arrest genes in that they are DNA damage inducible and suggest a role for these genes in growth arrest and apoptosis either coupled with or uncoupled from terminal differentiation. Evidence is also presented for coordinate induction in vivo by stress. The use of a short-term transfection assay, in which expression vectors for one or a combination of these gadd/MyD genes were transfected with a selectable marker into several different human tumor cell lines, provided direct evidence for the growth-inhibitory functions of the products of these genes and their ability to synergistically suppress growth. Taken together, these observations indicate that these genes define a novel class of mammalian genes encoding acidic proteins involved in the control of cellular growth.
Collapse
Affiliation(s)
- Q Zhan
- Laboratory of Molecular Pharmacology, National Cancer Institute, Bethesda, Maryland 20892
| | | | | | | | | | | | | | | | | | | |
Collapse
|
96
|
Abstract
Sequence similarity search programs are versatile tools for the molecular biologist, frequently able to identify possible DNA coding regions and to provide clues to gene and protein structure and function. While much attention had been paid to the precise algorithms these programs employ and to their relative speeds, there is a constellation of associated issues that are equally important to realize the full potential of these methods. Here, we consider a number of these issues, including the choice of scoring systems, the statistical significance of alignments, the masking of uninformative or potentially confounding sequence regions, the nature and extent of sequence redundancy in the databases and network access to similarity search services.
Collapse
Affiliation(s)
- S F Altschul
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
| | | | | | | |
Collapse
|
97
|
DNA amplification fingerprinting: A general tool with applications in breeding, identification and phylogenetic analysis of plants. ACTA ACUST UNITED AC 1994. [DOI: 10.1007/978-3-0348-7527-1_2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
98
|
Sapolsky RJ, Brendel V, Karlin S. A comparative analysis of distinctive features of yeast protein sequences. Yeast 1993; 9:1287-98. [PMID: 8154180 DOI: 10.1002/yea.320091202] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
The recently published sequence of yeast chromosome III (YCIII) provides the longest continuous stretch of a eukaryotic DNA molecule sequenced to date (315 kb). The sequence contains 116 distinct AUG-initiated open reading frames of at least 200 codons in length, more than 50 of which had not been described previously nor bear significant similarity to known proteins. We have analysed the YCIII known and putative protein sequences with respect to significant statistical features which might reflect on structural and functional characteristics. The YCIII proteins have striking similarities and differences in their sequence attribute distributions compared to the corresponding distributions for all available yeast sequences and other protein collections. Nine examples of YCIII proteins with distinctive sequence features are discussed in detail.
Collapse
Affiliation(s)
- R J Sapolsky
- Department of Mathematics, Stanford University, CA 94305-2125
| | | | | |
Collapse
|
99
|
Abstract
Recent developments in the statistical analysis of DNA sequences are reviewed. The pace with which sequence data are being generated and analysed has increased with the growth of the human genome project. Two areas of activity are emphasized: attention to error rates in recorded sequences, and heterogeneity in structure of sequences. There is now empirical evidence suggesting error rates in the range 0.1%-1%, and such rates will affect evolutionary studies since these are about the rates at which DNA sequences from different individuals are expected to differ. Heterogeneity for such quantities as base composition, or lengths between successive subsequences of specified types, may be sufficient to account for observed long-range correlations between bases. The need for statistical models and analyses of DNA sequence data will continue, and will offer interesting challenges.
Collapse
Affiliation(s)
- B S Weir
- Department of Statistics, North Carolina State University, Raleigh 27695-8203
| |
Collapse
|
100
|
Caetano-Anollés G. Amplifying DNA with arbitrary oligonucleotide primers. PCR METHODS AND APPLICATIONS 1993; 3:85-94. [PMID: 8268791 DOI: 10.1101/gr.3.2.85] [Citation(s) in RCA: 215] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Affiliation(s)
- G Caetano-Anollés
- Institute of Agriculture, University of Tennessee, Knoxville 37901-1071
| |
Collapse
|