Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Salerno W, Havlak P, Miller J. Scale-invariant structure of strongly conserved sequence in genomic intersections and alignments. Proc Natl Acad Sci U S A 2006;103:13121-5. [PMID: 16924100 PMCID: PMC1559763 DOI: 10.1073/pnas.0605735103] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open

For:	Salerno W, Havlak P, Miller J. Scale-invariant structure of strongly conserved sequence in genomic intersections and alignments. Proc Natl Acad Sci U S A 2006;103:13121-5. [PMID: 16924100 PMCID: PMC1559763 DOI: 10.1073/pnas.0605735103] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open

Number

Cited by Other Article(s)

Sheinman M, Arndt PF, Massip F. Modeling the mosaic structure of bacterial genomes to infer their evolutionary history. Proc Natl Acad Sci U S A 2024;121:e2313367121. [PMID: 38517978 PMCID: PMC10990148 DOI: 10.1073/pnas.2313367121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 01/30/2024] [Indexed: 03/24/2024] Open

Gao K, Miller J. Primary orthologs from local sequence context. BMC Bioinformatics 2020;21:48. [PMID: 32028880 PMCID: PMC7006074 DOI: 10.1186/s12859-020-3384-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Accepted: 01/22/2020] [Indexed: 02/05/2023] Open

Abstract

BACKGROUND

The evolutionary history of genes serves as a cornerstone of contemporary biology. Most conserved sequences in mammalian genomes don't code for proteins, yielding a need to infer evolutionary history of sequences irrespective of what kind of functional element they may encode. Thus, sequence-, as opposed to gene-, centric modes of inferring paths of sequence evolution are increasingly relevant. Customarily, homologous sequences derived from the same direct ancestor, whose ancestral position in two genomes is usually conserved, are termed "primary" (or "positional") orthologs. Methods based solely on similarity don't reliably distinguish primary orthologs from other homologs; for this, genomic context is often essential. Context-dependent identification of orthologs traditionally relies on genomic context over length scales characteristic of conserved gene order or whole-genome sequence alignment, and can be computationally intensive.

RESULTS

We demonstrate that short-range sequence context-as short as a single "maximal" match- distinguishes primary orthologs from other homologs across whole genomes. On mammalian whole genomes not preprocessed by repeat-masker, potential orthologs are extracted by genome intersection as "non-nested maximal matches:" maximal matches that are not nested into other maximal matches. It emerges that on both nucleotide and gene scales, non-nested maximal matches recapitulate primary or positional orthologs with high precision and high recall, while the corresponding computation consumes less than one thirtieth of the computation time required by commonly applied whole-genome alignment methods. In regions of genomes that would be masked by repeat-masker, non-nested maximal matches recover orthologs that are inaccessible to Lastz net alignment, for which repeat-masking is a prerequisite. mmRBHs, reciprocal best hits of genes containing non-nested maximal matches, yield novel putative orthologs, e.g. around 1000 pairs of genes for human-chimpanzee.

CONCLUSIONS

We describe an intersection-based method that requires neither repeat-masking nor alignment to infer evolutionary history of sequences based on short-range genomic sequence context. Ortholog identification based on non-nested maximal matches is parameter-free, and less computationally intensive than many alignment-based methods. It is especially suitable for genome-wide identification of orthologs, and may be applicable to unassembled genomes. We are agnostic as to the reasons for its effectiveness, which may reflect local variation of mean mutation rate.

Collapse

Polychronopoulos D, King JWD, Nash AJ, Tan G, Lenhard B. Conserved non-coding elements: developmental gene regulation meets genome organization. Nucleic Acids Res 2018;45:12611-12624. [PMID: 29121339 PMCID: PMC5728398 DOI: 10.1093/nar/gkx1074] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Accepted: 10/24/2017] [Indexed: 12/20/2022] Open

Comparing the Statistical Fate of Paralogous and Orthologous Sequences. Genetics 2016;204:475-482. [PMID: 27474728 DOI: 10.1534/genetics.116.193912] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Accepted: 07/26/2016] [Indexed: 02/01/2023] Open

Gao K, Miller J. Human-chimpanzee alignment: ortholog exponentials and paralog power laws. Comput Biol Chem 2014;53 Pt A:59-70. [PMID: 25443749 DOI: 10.1016/j.compbiolchem.2014.08.010] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 11/27/2022]

Almirantis Y, Arndt P, Li W, Provata A. Editorial: Complexity in genomes. Comput Biol Chem 2014;53 Pt A:1-4. [DOI: 10.1016/j.compbiolchem.2014.08.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]

Massip F, Sheinman M, Schbath S, Arndt PF. How evolution of genomes is reflected in exact DNA sequence match statistics. Mol Biol Evol 2014;32:524-35. [PMID: 25398628 PMCID: PMC4298173 DOI: 10.1093/molbev/msu313] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open

Provata A, Nicolis C, Nicolis G. Complexity measures for the evolutionary categorization of organisms. Comput Biol Chem 2014;53 Pt A:5-14. [PMID: 25216557 DOI: 10.1016/j.compbiolchem.2014.08.004] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 01/17/2023]

Abstract

Complexity measures are used to compare the genomic characteristics of five organisms belonging to distinct classes spanning the evolutionary tree: higher eukaryotes, amoebae, unicellular eukaryotes and bacteria. The comparisons are undertaken using the full four-letter alphabet and the coarse grained two-letter alphabets AG-CT and AT-CG. We show that the conditional probability matrix for the four-letter and AT-CG alphabet is markedly asymmetric in eukaryotes while it is nearly symmetric in bacterial genomes. Spatial asymmetry is revealed in the four-letter alphabet, signifying that the probability fluxes are nonvanishing and thus the reading sense of a sequence is irreversible for all organisms. Calculations of the block entropy and excess entropy demonstrate that the human genome accommodates better all possible block configurations, especially for long blocks. With respect to point-to-point details and to spatial arrangement of blocks the exit distance distributions from a particular letter demonstrate long distance characteristics in the eukaryotic sequences for all three alphabets, while the bacterial (prokaryotic) genomes deviate indicating short range characteristics. Overall, the conditional probability, the fluxes, the block entropy content and the exit distance distributions can be used as markers, discriminating between eukaryotic and prokaryotic DNA, allowing in many cases to discern details related to finer classes. In all cases the reduction from four letters to two masks some important statistical and spatial properties, with the AT-CG alphabet having higher ability of discrimination than the AG-CT one. In particular, the AT-CG alphabet reduction accentuates the CpG related properties (conditional probabilities w32, long ranged exit distance distribution for A and T nucleotides), but masks sequence asymmetry and irreversibility in all examined organisms.

Collapse

Li W, Freudenberg J. Characterizing regions in the human genome unmappable by next-generation-sequencing at the read length of 1000 bases. Comput Biol Chem 2014;53 Pt A:108-17. [PMID: 25241312 DOI: 10.1016/j.compbiolchem.2014.08.015] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 12/31/2022]

Polychronopoulos D, Sellis D, Almirantis Y. Conserved noncoding elements follow power-law-like distributions in several genomes as a result of genome dynamics. PLoS One 2014;9:e95437. [PMID: 24787386 PMCID: PMC4008492 DOI: 10.1371/journal.pone.0095437] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2013] [Accepted: 03/26/2014] [Indexed: 12/31/2022] Open

Carbone A. Information measure for long-range correlated sequences: the case of the 24 human chromosomes. Sci Rep 2014;3:2721. [PMID: 24056670 PMCID: PMC3779848 DOI: 10.1038/srep02721] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2013] [Accepted: 09/04/2013] [Indexed: 01/14/2023] Open

Li W, Freudenberg J, Miramontes P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics 2014;15:2. [PMID: 24386976 PMCID: PMC3927684 DOI: 10.1186/1471-2105-15-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2013] [Accepted: 12/17/2013] [Indexed: 11/10/2022] Open

Abstract

Background

The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp.

Results

We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications.

Conclusion

Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.

Collapse

Taillefer E, Miller J. Exhaustive computation of exact duplications via super and non-nested local maximal repeats. J Bioinform Comput Biol 2013;12:1350018. [PMID: 24467757 DOI: 10.1142/s0219720013500182] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]

Koroteev MV, Miller J. Scale-free duplication dynamics: a model for ultraduplication. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2011;84:061919. [PMID: 22304128 DOI: 10.1103/physreve.84.061919] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2010] [Revised: 07/04/2011] [Indexed: 05/31/2023]

Algebraic distribution of segmental duplication lengths in whole-genome sequence self-alignments. PLoS One 2011;6:e18464. [PMID: 21779315 PMCID: PMC3136455 DOI: 10.1371/journal.pone.0018464] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2010] [Accepted: 03/08/2011] [Indexed: 01/25/2023] Open

Gu P, Reid JG, Gao X, Shaw CA, Creighton C, Tran PL, Zhou X, Drabek RB, Steffen DL, Hoang DM, Weiss MK, Naghavi AO, El-daye J, Khan MF, Legge GB, Wheeler DA, Gibbs RA, Miller JN, Cooney AJ, Gunaratne PH. Novel microRNA candidates and miRNA-mRNA pairs in embryonic stem (ES) cells. PLoS One 2008;3:e2548. [PMID: 18648548 PMCID: PMC2481296 DOI: 10.1371/journal.pone.0002548] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2007] [Accepted: 05/22/2008] [Indexed: 01/24/2023] Open

Affiliation(s)

Peili Gu Department of Molecular & Cellular Biology, Baylor College of Medicine, Houston, Texas, United States of America Department of Cancer Genetics, M.D. Anderson Cancer Center, University of Texas , Houston, Texas, United States of America
Jeffrey G. Reid Department of Chemistry, University of Houston, Houston, Texas, United States of America Department of Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America W. M. Keck Center for Interdisciplinary Bioscience Training, Houston, Texas, United States of America
Xiaolian Gao Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America Department of Chemistry, University of Houston, Houston, Texas, United States of America
Chad A. Shaw Department of Molecular & Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
Chad Creighton Duncan Cancer Center, Baylor College of Medicine, Houston, Texas, United States of America
Peter L. Tran Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
Xiaochuan Zhou LC Sciences, Houston, Texas, United States of America
Rafal B. Drabek Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America Department of Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
David L. Steffen Department of Molecular & Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America Bioinformatics Research Center, Baylor College of Medicine, Houston, Texas, United States of America
David M. Hoang Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
Michelle K. Weiss Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
Arash O. Naghavi Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
Jad El-daye Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
Mahjabeen F. Khan Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
Glen B. Legge Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
David A. Wheeler Department of Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
Richard A. Gibbs Department of Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
Jonathan N. Miller Department of Biochemistry & Molecular Biology, Baylor College of Medicine, Houston, Texas, United States of America Department of Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
Austin J. Cooney Department of Molecular & Cellular Biology, Baylor College of Medicine, Houston, Texas, United States of America
Preethi H. Gunaratne Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America Department of Pathology, Baylor College of Medicine, Houston, Texas, United States of America Department of Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America

Collapse

Virtanen C, Takahashi M. Muscling in on microarrays. Appl Physiol Nutr Metab 2008;33:124-9. [PMID: 18347662 DOI: 10.1139/h07-150] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]

Winkler DA. Network models in drug discovery and regenerative medicine. BIOTECHNOLOGY ANNUAL REVIEW 2008;14:143-70. [PMID: 18606362 DOI: 10.1016/s1387-2656(08)00005-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]