1
|
Chantzi N, Mareboina M, Konnaris MA, Montgomery A, Patsakis M, Mouratidis I, Georgakopoulos-Soares I. The determinants of the rarity of nucleic and peptide short sequences in nature. NAR Genom Bioinform 2024; 6:lqae029. [PMID: 38584871 PMCID: PMC10993293 DOI: 10.1093/nargab/lqae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 02/21/2024] [Accepted: 03/18/2024] [Indexed: 04/09/2024] Open
Abstract
The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the R² performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved R² performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.
Collapse
Affiliation(s)
- Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Maxwell A Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
- Department of Statistics, Penn State University, University Park, PA, 16802, USA
- Huck Institutes of the Life Sciences, Penn State University, University Park, PA, 16802, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
- Huck Institutes of the Life Sciences, Penn State University, University Park, PA, 16802, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| |
Collapse
|
2
|
Roberts M, Josephs EB. Previously unmeasured genetic diversity explains part of Lewontin's paradox in a k-mer-based meta-analysis of 112 plant species. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.17.594778. [PMID: 38798362 PMCID: PMC11118579 DOI: 10.1101/2024.05.17.594778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
At the molecular level, most evolution is expected to be neutral. A key prediction of this expectation is that the level of genetic diversity in a population should scale with population size. However, as was noted by Richard Lewontin in 1974 and reaffirmed by later studies, the slope of the population size-diversity relationship in nature is much weaker than expected under neutral theory. We hypothesize that one contributor to this paradox is that current methods relying on single nucleotide polymorphisms (SNPs) called from aligning short reads to a reference genome underestimate levels of genetic diversity in many species. To test this idea, we calculated nucleotide diversity ( π ) and k-mer-based metrics of genetic diversity across 112 plant species, amounting to over 205 terabases of DNA sequencing data from 27,488 individual plants. We then compared how these different metrics correlated with proxies of population size that account for both range size and population density variation across species. We found that our population size proxies scaled anywhere from about 3 to over 20 times faster with k-mer diversity than nucleotide diversity after adjusting for evolutionary history, mating system, life cycle habit, cultivation status, and invasiveness. The relationship between k-mer diversity and population size proxies also remains significant after correcting for genome size, whereas the analogous relationship for nucleotide diversity does not. These results suggest that variation not captured by common SNP-based analyses explains part of Lewontin's paradox in plants.
Collapse
|
3
|
Ponsero AJ, Miller M, Hurwitz BL. Comparison of k-mer-based de novo comparative metagenomic tools and approaches. MICROBIOME RESEARCH REPORTS 2023; 2:27. [PMID: 38058765 PMCID: PMC10696585 DOI: 10.20517/mrr.2023.26] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 06/28/2023] [Accepted: 07/12/2023] [Indexed: 12/08/2023]
Abstract
Aim: Comparative metagenomic analysis requires measuring a pairwise similarity between metagenomes in the dataset. Reference-based methods that compute a beta-diversity distance between two metagenomes are highly dependent on the quality and completeness of the reference database, and their application on less studied microbiota can be challenging. On the other hand, de-novo comparative metagenomic methods only rely on the sequence composition of metagenomes to compare datasets. While each one of these approaches has its strengths and limitations, their comparison is currently limited. Methods: We developed sets of simulated short-reads metagenomes to (1) compare k-mer-based and taxonomy-based distances and evaluate the impact of technical and biological variables on these metrics and (2) evaluate the effect of k-mer sketching and filtering. We used a real-world metagenomic dataset to provide an overview of the currently available tools for de novo metagenomic comparative analysis. Results: Using simulated metagenomes of known composition and controlled error rate, we showed that k-mer-based distance metrics were well correlated to the taxonomic distance metric for quantitative Beta-diversity metrics, but the correlation was low for presence/absence distances. The community complexity in terms of taxa richness and the sequencing depth significantly affected the quality of the k-mer-based distances, while the impact of low amounts of sequence contamination and sequencing error was limited. Finally, we benchmarked currently available de-novo comparative metagenomic tools and compared their output on two datasets of fecal metagenomes and showed that most k-mer-based tools were able to recapitulate the data structure observed using taxonomic approaches. Conclusion: This study expands our understanding of the strength and limitations of k-mer-based de novo comparative metagenomic approaches and aims to provide concrete guidelines for researchers interested in applying these approaches to their metagenomic datasets.
Collapse
Affiliation(s)
- Alise Jany Ponsero
- Human Microbiome Research Program, University of Helsinki, Helsinki 00290, Finland
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ 85721, USA
- BIO5 Institute, The University of Arizona, Tucson, AZ 85721, USA
| | - Matthew Miller
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ 85721, USA
| | - Bonnie Louise Hurwitz
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ 85721, USA
- BIO5 Institute, The University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
4
|
Xu X, Yin Z, Yan L, Zhang H, Xu B, Wei Y, Niu B, Schmidt B, Liu W. RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches. Genome Biol 2023; 24:121. [PMID: 37198663 PMCID: PMC10190105 DOI: 10.1186/s13059-023-02961-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Accepted: 05/05/2023] [Indexed: 05/19/2023] Open
Abstract
We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.
Collapse
Affiliation(s)
- Xiaoming Xu
- School of Software, Shandong University, Jinan, China
| | - Zekun Yin
- School of Software, Shandong University, Jinan, China
- Shenzhen Research Institute of Shandong University, Shandong University, Shenzhen, China
| | - Lifeng Yan
- School of Software, Shandong University, Jinan, China
- Shenzhen Research Institute of Shandong University, Shandong University, Shenzhen, China
| | - Hao Zhang
- School of Software, Shandong University, Jinan, China
- Shenzhen Research Institute of Shandong University, Shandong University, Shenzhen, China
| | - Borui Xu
- School of Software, Shandong University, Jinan, China
| | - Yanjie Wei
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Beifang Niu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Bertil Schmidt
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Weiguo Liu
- School of Software, Shandong University, Jinan, China
| |
Collapse
|
5
|
Zheng Y, Shi J, Chen Q, Deng C, Yang F, Wang Y. Identifying individual-specific microbial DNA fingerprints from skin microbiomes. Front Microbiol 2022; 13:960043. [PMID: 36274714 PMCID: PMC9583911 DOI: 10.3389/fmicb.2022.960043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Accepted: 09/16/2022] [Indexed: 11/22/2022] Open
Abstract
Skin is an important ecosystem that links the human body and the external environment. Previous studies have shown that the skin microbial community could remain stable, even after long-term exposure to the external environment. In this study, we explore two questions: Do there exist strains or genetic variants in skin microorganisms that are individual-specific, temporally stable, and body site-independent? And if so, whether such microorganismal genetic variants could be used as markers, called “fingerprints” in our study, to identify donors? We proposed a framework to capture individual-specific DNA microbial fingerprints from skin metagenomic sequencing data. The fingerprints are identified on the frequency of 31-mers free from reference genomes and sequence alignments. The 616 metagenomic samples from 17 skin sites at 3-time points from 12 healthy individuals from Integrative Human Microbiome Project were adopted. Ultimately, one contig for each individual is assembled as a fingerprint. And results showed that 89.78% of the skin samples despite body sites could identify their donors correctly. It is observed that 10 out of 12 individual-specific fingerprints could be aligned to Cutibacterium acnes. Our study proves that the identified fingerprints are temporally stable, body site-independent, and individual-specific, and can identify their donors with enough accuracy. The source code of the genetic identification framework is freely available at https://github.com/Ying-Lab/skin_fingerprint.
Collapse
Affiliation(s)
- Yiluan Zheng
- Department of Automation, Xiamen University, Xiamen, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, China
| | - Jianlu Shi
- Stomatological Hospital of Xiamen Medical College, Xiamen, China
- Xiamen Key Laboratory of Stomatological Disease Diagnosis and Treatment, Xiamen, China
| | - Qi Chen
- Department of Automation, Xiamen University, Xiamen, China
| | - Chao Deng
- Department of Automation, Xiamen University, Xiamen, China
| | - Fan Yang
- Department of Automation, Xiamen University, Xiamen, China
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, China
- Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision, Xiamen, China
- Fujian Key Laboratory of Genetics and Breeding of Marine Organisms, Xiamen, China
- *Correspondence: Ying Wang
| |
Collapse
|
6
|
Mc Cartney AM, Shafin K, Alonge M, Bzikadze AV, Formenti G, Fungtammasan A, Howe K, Jain C, Koren S, Logsdon GA, Miga KH, Mikheenko A, Paten B, Shumate A, Soto DC, Sović I, Wood JMD, Zook JM, Phillippy AM, Rhie A. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat Methods 2022; 19:687-695. [PMID: 35361931 PMCID: PMC9812399 DOI: 10.1038/s41592-022-01440-3] [Citation(s) in RCA: 40] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Accepted: 03/04/2022] [Indexed: 01/07/2023]
Abstract
Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k-mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.
Collapse
Affiliation(s)
- Ann M. Mc Cartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH
| | - Kishwar Shafin
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Michael Alonge
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Andrey V. Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California San Diego, La Jolla, CA, USA
| | - Giulio Formenti
- Laboratory of Neurogenetics of Language and The Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
| | | | | | - Chirag Jain
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH,Department of Computational and Data Sciences, Indian Institute of Science, Bangalore KA, India
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH
| | - Glennis A. Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Karen H. Miga
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Daniela C. Soto
- Genome Center, MIND Institute, Department of Biochemistry and Molecular Medicine, University of California, Davis, CA, USA
| | - Ivan Sović
- Pacific Biosciences, Menlo Park, CA, USA,Digital BioLogic d.o.o., Ivanić-Grad, Croatia
| | | | - Justin M. Zook
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Adam M. Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH,Correspondence: ,
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH,Correspondence: ,
| |
Collapse
|
7
|
Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods 2022; 19:696-704. [PMID: 35361932 PMCID: PMC9745813 DOI: 10.1038/s41592-022-01445-y] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Accepted: 03/07/2022] [Indexed: 12/15/2022]
Abstract
Variant calling has been widely used for genotyping and for improving the consensus accuracy of long-read assemblies. Variant calls are commonly hard-filtered with user-defined cutoffs. However, it is impossible to define a single set of optimal cutoffs, as the calls heavily depend on the quality of the reads, the variant caller of choice and the quality of the unpolished assembly. Here, we introduce Merfin, a k-mer based variant-filtering algorithm for improved accuracy in genotyping and genome assembly polishing. Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller's internal score. Merfin increased the precision of genotyped calls in several benchmarks, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from Pacific Biosciences HiFi and continuous long reads or Oxford Nanopore reads, including the first complete human genome. Moreover, we introduce assembly quality and completeness metrics that account for the expected genomic copy numbers.
Collapse
|
8
|
Hoarfrost A, Aptekmann A, Farfañuk G, Bromberg Y. Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter. Nat Commun 2022; 13:2606. [PMID: 35545619 PMCID: PMC9095714 DOI: 10.1038/s41467-022-30070-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Accepted: 03/30/2022] [Indexed: 12/22/2022] Open
Abstract
The majority of microbial genomes have yet to be cultured, and most proteins identified in microbial genomes or environmental sequences cannot be functionally annotated. As a result, current computational approaches to describe microbial systems rely on incomplete reference databases that cannot adequately capture the functional diversity of the microbial tree of life, limiting our ability to model high-level features of biological sequences. Here we present LookingGlass, a deep learning model encoding contextually-aware, functionally and evolutionarily relevant representations of short DNA reads, that distinguishes reads of disparate function, homology, and environmental origin. We demonstrate the ability of LookingGlass to be fine-tuned via transfer learning to perform a range of diverse tasks: to identify novel oxidoreductases, to predict enzyme optimal temperature, and to recognize the reading frames of DNA sequence fragments. LookingGlass enables functionally relevant representations of otherwise unknown and unannotated sequences, shedding light on the microbial dark matter that dominates life on Earth. Computational methods to analyse microbial systems rely on reference databases which do not capture their full functional diversity. Here the authors develop a deep learning model and apply it using transfer learning, creating biologically useful models for multiple different tasks.
Collapse
Affiliation(s)
- A Hoarfrost
- Department of Marine and Coastal Sciences, Rutgers University, 71 Dudley Road, New Brunswick, NJ, 08873, USA. .,NASA Ames Research Center, Moffett Field, CA, 94035, USA.
| | - A Aptekmann
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ, 08901, USA
| | - G Farfañuk
- Department of Biological Chemistry, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Y Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ, 08901, USA.
| |
Collapse
|
9
|
Hoyt SJ, Storer JM, Hartley GA, Grady PGS, Gershman A, de Lima LG, Limouse C, Halabian R, Wojenski L, Rodriguez M, Altemose N, Rhie A, Core LJ, Gerton JL, Makalowski W, Olson D, Rosen J, Smit AFA, Straight AF, Vollger MR, Wheeler TJ, Schatz MC, Eichler EE, Phillippy AM, Timp W, Miga KH, O’Neill RJ. From telomere to telomere: The transcriptional and epigenetic state of human repeat elements. Science 2022; 376:eabk3112. [PMID: 35357925 PMCID: PMC9301658 DOI: 10.1126/science.abk3112] [Citation(s) in RCA: 114] [Impact Index Per Article: 57.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Mobile elements and repetitive genomic regions are sources of lineage-specific genomic innovation and uniquely fingerprint individual genomes. Comprehensive analyses of such repeat elements, including those found in more complex regions of the genome, require a complete, linear genome assembly. We present a de novo repeat discovery and annotation of the T2T-CHM13 human reference genome. We identified previously unknown satellite arrays, expanded the catalog of variants and families for repeats and mobile elements, characterized classes of complex composite repeats, and located retroelement transduction events. We detected nascent transcription and delineated CpG methylation profiles to define the structure of transcriptionally active retroelements in humans, including those in centromeres. These data expand our insight into the diversity, distribution, and evolution of repetitive regions that have shaped the human genome.
Collapse
Affiliation(s)
- Savannah J. Hoyt
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | | | - Gabrielle A. Hartley
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Patrick G. S. Grady
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Ariel Gershman
- Department of Molecular Biology and Genetics, Johns Hopkins University, Baltimore, MD, USA
| | | | - Charles Limouse
- Department of Biochemistry, Stanford University, Stanford, CA, USA
| | - Reza Halabian
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Luke Wojenski
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Matias Rodriguez
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Nicolas Altemose
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Leighton J. Core
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
| | | | - Wojciech Makalowski
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Daniel Olson
- Department of Computer Science, University of Montana, Missoula, MT, USA
| | - Jeb Rosen
- Institute for Systems Biology, Seattle, WA, USA
| | | | | | - Mitchell R. Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Travis J. Wheeler
- Department of Computer Science, University of Montana, Missoula, MT, USA
| | - Michael C. Schatz
- Department of Computer Science and Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Adam M. Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Winston Timp
- Department of Molecular Biology and Genetics, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Karen H. Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Rachel J. O’Neill
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- Department of Genetics and Genome Sciences, UConn Health, Farmington, CT, USA
| |
Collapse
|
10
|
Katz KS, Shutov O, Lapoint R, Kimelman M, Brister JR, O’Sullivan C. STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions. Genome Biol 2021; 22:270. [PMID: 34544477 PMCID: PMC8450716 DOI: 10.1186/s13059-021-02490-0] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Accepted: 09/08/2021] [Indexed: 02/03/2024] Open
Abstract
Sequence Read Archive submissions to the National Center for Biotechnology Information often lack useful metadata, which limits the utility of these submissions. We describe the Sequence Taxonomic Analysis Tool (STAT), a scalable k-mer-based tool for fast assessment of taxonomic diversity intrinsic to submissions, independent of metadata. We show that our MinHash-based k-mer tool is accurate and scalable, offering reliable criteria for efficient selection of data for further analysis by the scientific community, at once validating submissions while also augmenting sample metadata with reliable, searchable, taxonomic terms.
Collapse
Affiliation(s)
- Kenneth S. Katz
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894 USA
| | - Oleg Shutov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894 USA
| | - Richard Lapoint
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894 USA
| | - Michael Kimelman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894 USA
| | - J. Rodney Brister
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894 USA
| | - Christopher O’Sullivan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894 USA
| |
Collapse
|
11
|
Xu M, Guo L, Du X, Li L, Peters BA, Deng L, Wang O, Chen F, Wang J, Jiang Z, Han J, Ni M, Yang H, Xu X, Liu X, Huang J, Fan G. Accurate Haplotype-Resolved Assembly Reveals The Origin Of Structural Variants For Human Trios. Bioinformatics 2021; 37:2095-2102. [PMID: 33538292 PMCID: PMC8613828 DOI: 10.1093/bioinformatics/btab068] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Revised: 12/07/2020] [Accepted: 01/28/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Achieving a near complete understanding of how the genome of an individual affects the phenotypes of that individual requires deciphering the order of variations along homologous chromosomes in species with diploid genomes. However, true diploid assembly of long-range haplotypes remains challenging. RESULTS To address this, we have developed Haplotype-resolved Assembly for Synthetic long reads using a Trio-binning strategy, or HAST, which uses parental information to classify reads into maternal or paternal. Once sorted, these reads are used to independently de novo assemble the parent-specific haplotypes. We applied HAST to co-barcoded second-generation sequencing data from an Asian individual, resulting in a haplotype assembly covering 94.7% of the reference genome with a scaffold N50 longer than 11 Mb. The high haplotyping precision (∼99.7%) and recall (∼95.9%) represents a substantial improvement over the commonly used tool for assembling co-barcoded reads (Supernova), and is comparable to a trio-binning-based third generation long-read based assembly method (TrioCanu) but with a significantly higher single-base accuracy (up to 99.99997% (Q65)). This makes HAST a superior tool for accurate haplotyping and future haplotype-based studies. AVAILABILITY The code of the analysis is available at https://github.com/BGI-Qingdao/HAST. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mengyang Xu
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China
| | - Lidong Guo
- BGI-QingDao, Qingdao, 266555, China.,BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, 518083, China
| | - Xiao Du
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China
| | - Lei Li
- BGI-QingDao, Qingdao, 266555, China.,School of Future Technology, University of Chinese Academy of Sciences, Beijing, 101408, China
| | - Brock A Peters
- BGI-Shenzhen, Shenzhen, 518083, China.,Complete Genomics Inc, 2904 Orchard Pkwy, San Jose, California, 95134, USA
| | - Li Deng
- BGI-QingDao, Qingdao, 266555, China
| | - Ou Wang
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Fang Chen
- MGI, BGI-Shenzhen, Shenzhen, 518083, China
| | - Jun Wang
- BGI-QingDao, Qingdao, 266555, China
| | | | | | - Ming Ni
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China
| | | | - Xun Xu
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Xin Liu
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China.,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Jie Huang
- National Institutes for food and drug Control (NIFDC), No.2 Tiantan Xili, Dongcheng District, Beijing, 10050, China
| | - Guangyi Fan
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China.,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| |
Collapse
|
12
|
Criscuolo A. On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference. F1000Res 2020; 9:1309. [PMID: 33335719 PMCID: PMC7713896 DOI: 10.12688/f1000research.26930.1] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/12/2020] [Indexed: 12/29/2022] Open
Abstract
Recently developed MinHash-based techniques were proven successful in quickly estimating the level of similarity between large nucleotide sequences. This article discusses their usage and limitations in practice to approximating uncorrected distances between genomes, and transforming these pairwise dissimilarities into proper evolutionary distances. It is notably shown that complex distance measures can be easily approximated using simple transformation formulae based on few parameters. MinHash-based techniques can therefore be very useful for implementing fast yet accurate alignment-free phylogenetic reconstruction procedures from large sets of genomes. This last point of view is assessed with a simulation study using a dedicated bioinformatics tool.
Collapse
Affiliation(s)
- Alexis Criscuolo
- Hub de Bioinformatique et Biostatistique - Département Biologie Computationnelle, Institut Pasteur, USR 3756, CNRS, 75015 Paris, France
| |
Collapse
|
13
|
Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 2020; 21:245. [PMID: 32928274 PMCID: PMC7488777 DOI: 10.1186/s13059-020-02134-9] [Citation(s) in RCA: 617] [Impact Index Per Article: 154.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Accepted: 08/06/2020] [Indexed: 01/26/2023] Open
Abstract
Recent long-read assemblies often exceed the quality and completeness of available reference genomes, making validation challenging. Here we present Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness. For trios, Merqury can also evaluate haplotype-specific accuracy, completeness, phase block continuity, and switch errors. Multiple visualizations, such as k-mer spectrum plots, can be generated for evaluation. We demonstrate on both human and plant genomes that Merqury is a fast and robust method for assembly validation.
Collapse
Affiliation(s)
- Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD USA
| | - Brian P. Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD USA
| | - Adam M. Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD USA
| |
Collapse
|
14
|
Nurk S, Walenz BP, Rhie A, Vollger MR, Logsdon GA, Grothe R, Miga KH, Eichler EE, Phillippy AM, Koren S. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res 2020; 30:1291-1305. [PMID: 32801147 PMCID: PMC7545148 DOI: 10.1101/gr.263566.120] [Citation(s) in RCA: 315] [Impact Index Per Article: 78.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Accepted: 08/04/2020] [Indexed: 12/14/2022]
Abstract
Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced Pacific Biosciences (PacBio) HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultralong Oxford Nanopore Technologies (ONT) reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of nine complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance toward the complete assembly of human genomes.
Collapse
Affiliation(s)
- Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Robert Grothe
- Pacific Biosciences, Menlo Park, California 94025, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95064, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| |
Collapse
|
15
|
Bermúdez-Barrientos JR, Ramírez-Sánchez O, Chow FWN, Buck AH, Abreu-Goodger C. Disentangling sRNA-Seq data to study RNA communication between species. Nucleic Acids Res 2020; 48:e21. [PMID: 31879784 PMCID: PMC7038986 DOI: 10.1093/nar/gkz1198] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Revised: 11/23/2019] [Accepted: 12/18/2019] [Indexed: 12/28/2022] Open
Abstract
Many organisms exchange small RNAs (sRNAs) during their interactions, that can target or bolster defense strategies in host-pathogen systems. Current sRNA-Seq technology can determine the sRNAs present in any symbiotic system, but there are very few bioinformatic tools available to interpret the results. We show that one of the biggest challenges comes from sequences that map equally well to the genomes of both interacting organisms. This arises due to the small size of the sRNAs compared to large genomes, and because a large portion of sequenced sRNAs come from genomic regions that encode highly conserved miRNAs, rRNAs or tRNAs. Here, we present strategies to disentangle sRNA-Seq data from samples of communicating organisms, developed using diverse plant and animal species that are known to receive or exchange RNA with their symbionts. We show that sequence assembly, both de novo and genome-guided, can be used for these sRNA-Seq data, greatly reducing the ambiguity of mapping reads. Even confidently mapped sequences can be misleading, so we further demonstrate the use of differential expression strategies to determine true parasite-derived sRNAs within host cells. We validate our methods on new experiments designed to probe the nature of the extracellular vesicle sRNAs from the parasitic nematode Heligmosomoides bakeri that get into mouse intestinal epithelial cells.
Collapse
Affiliation(s)
- José Roberto Bermúdez-Barrientos
- Unidad de Genómica Avanzada (Langebio), Centro de Investigación y de Estudios Avanzados del IPN, Irapuato, Guanajuato 36824, México
| | - Obed Ramírez-Sánchez
- Unidad de Genómica Avanzada (Langebio), Centro de Investigación y de Estudios Avanzados del IPN, Irapuato, Guanajuato 36824, México
| | - Franklin Wang-Ngai Chow
- Institute of Immunology and Infection Research and Centre for Immunity, Infection & Evolution, School of Biological Sciences, The University of Edinburgh, Edinburgh EH9 3JT, UK
| | - Amy H Buck
- Institute of Immunology and Infection Research and Centre for Immunity, Infection & Evolution, School of Biological Sciences, The University of Edinburgh, Edinburgh EH9 3JT, UK
| | - Cei Abreu-Goodger
- Unidad de Genómica Avanzada (Langebio), Centro de Investigación y de Estudios Avanzados del IPN, Irapuato, Guanajuato 36824, México
| |
Collapse
|
16
|
Borrayo E, May-Canche I, Paredes O, Morales JA, Romo-Vázquez R, Vélez-Pérez H. Whole-Genome k-mer Topic Modeling AssociatesBacterial Families. Genes (Basel) 2020; 11:genes11020197. [PMID: 32075081 PMCID: PMC7074292 DOI: 10.3390/genes11020197] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Revised: 02/07/2020] [Accepted: 02/09/2020] [Indexed: 11/16/2022] Open
Abstract
Alignment-free k-mer-based algorithms in whole genome sequence comparisons remainan ongoing challenge. Here, we explore the possibility to use Topic Modeling for organismwhole-genome comparisons. We analyzed 30 complete genomes from three bacterial families bytopic modeling. For this, each genome was considered as a document and 13-mer nucleotiderepresentations as words. Latent Dirichlet allocation was used as the probabilistic modeling of thecorpus. We where able to identify the topic distribution among analyzed genomes, which is highlyconsistent with traditional hierarchical classification. It is possible that topic modeling may be appliedto establish relationships between genome's composition and biological phenomena.
Collapse
Affiliation(s)
- Ernesto Borrayo
- Electronics Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico;
| | - Isaias May-Canche
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
- Instituto Tecnológico de Chetumal, Quintana Roo 77000, Mexico
| | - Omar Paredes
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
| | - J. Alejandro Morales
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
| | - Rebeca Romo-Vázquez
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
| | - Hugo Vélez-Pérez
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
- Correspondence:
| |
Collapse
|
17
|
Criscuolo A. A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies. RESEARCH IDEAS AND OUTCOMES 2019. [DOI: 10.3897/rio.5.e36178] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
This paper describes a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools. For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution. These pairwise evolutionary distances are then used to infer a phylogenetic tree and assess a confidence support for each internal branch. Analyses of both simulated and real genome datasets show that this bioinformatics procedure allows accurate phylogenetic trees to be reconstructed with fast running times, especially when launched on multiple threads. Implemented in a publicly available script, named JolyTree, this procedure is a useful approach for quickly inferring species trees without the burden and potential biases of multiple sequence alignments.
Collapse
|
18
|
Sarmashghi S, Bohmann K, P. Gilbert MT, Bafna V, Mirarab S. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biol 2019; 20:34. [PMID: 30760303 PMCID: PMC6374904 DOI: 10.1186/s13059-019-1632-4] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2018] [Accepted: 01/16/2019] [Indexed: 01/10/2023] Open
Abstract
The ability to inexpensively describe taxonomic diversity is critical in this era of rapid climate and biodiversity changes. The recent genome-skimming approach extends current barcoding practices beyond short markers by applying low-pass sequencing and recovering whole organelle genomes computationally. This approach discards the nuclear DNA, which constitutes the vast majority of the data. In contrast, we suggest using all unassembled reads. We introduce an assembly-free and alignment-free tool, Skmer, to compute genomic distances between the query and reference genome skims. Skmer shows excellent accuracy in estimating distances and identifying the closest match in reference datasets.
Collapse
Affiliation(s)
- Shahab Sarmashghi
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, 92093 CA USA
| | - Kristine Bohmann
- Evolutionary Genomics, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark
- School of Biological Sciences, University of East Anglia, Norwich, Norfolk UK
| | - M. Thomas P. Gilbert
- Evolutionary Genomics, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark
- Norwegian University of Science and Technology, Trondheim, 7491 Norway
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, 92093 CA USA
| | - Siavash Mirarab
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, 92093 CA USA
| |
Collapse
|
19
|
Choi I, Ponsero AJ, Bomhoff M, Youens-Clark K, Hartman JH, Hurwitz BL. Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons. Gigascience 2019; 8:5266304. [PMID: 30597002 PMCID: PMC6354030 DOI: 10.1093/gigascience/giy165] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Accepted: 12/17/2018] [Indexed: 11/23/2022] Open
Abstract
Background Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.
Collapse
Affiliation(s)
- Illyoung Choi
- Department of Computer Science, University of Arizona, 1040 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Alise J Ponsero
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Matthew Bomhoff
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Ken Youens-Clark
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA
| | - John H Hartman
- Department of Computer Science, University of Arizona, 1040 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Bonnie L Hurwitz
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA.,BIO5 Institute, University of Arizona, 1657 E. Helen Street, Tucson, Arizona, 85719, USA
| |
Collapse
|
20
|
Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM. De novo assembly of haplotype-resolved genomes with trio binning. Nat Biotechnol 2018; 36:nbt.4277. [PMID: 30346939 PMCID: PMC6476705 DOI: 10.1038/nbt.4277] [Citation(s) in RCA: 248] [Impact Index Per Article: 41.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2018] [Accepted: 09/10/2018] [Indexed: 12/20/2022]
Abstract
Complex allelic variation hampers the assembly of haplotype-resolved sequences from diploid genomes. We developed trio binning, an approach that simplifies haplotype assembly by resolving allelic variation before assembly. In contrast with prior approaches, the effectiveness of our method improved with increasing heterozygosity. Trio binning uses short reads from two parental genomes to first partition long reads from an offspring into haplotype-specific sets. Each haplotype is then assembled independently, resulting in a complete diploid reconstruction. We used trio binning to recover both haplotypes of a diploid human genome and identified complex structural variants missed by alternative approaches. We sequenced an F1 cross between the cattle subspecies Bos taurus taurus and Bos taurus indicus and completely assembled both parental haplotypes with NG50 haplotig sizes of >20 Mb and 99.998% accuracy, surpassing the quality of current cattle reference genomes. We suggest that trio binning improves diploid genome assembly and will facilitate new studies of haplotype variation and inheritance.
Collapse
Affiliation(s)
- Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
| | - Brian P. Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
| | - Alexander T. Dilthey
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
- Institute of Medical Microbiology, Heinrich-Heine-University Düsseldorf, Düsseldorf, North Rhine-Westphalia, Germany
| | - Derek M. Bickhart
- Cell Wall Biology and Utilization Laboratory, ARS USDA, Madison, Wisconsin, USA
| | | | - Stefan Hiendleder
- Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy SA, Australia
- Robinson Research Institute, The University of Adelaide, Adelaide SA, Australia
| | - John L. Williams
- Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy SA, Australia
| | | | - Adam M. Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
| |
Collapse
|
21
|
Wang Z, Lou H, Wang Y, Shamir R, Jiang R, Chen T. GePMI: A statistical model for personal intestinal microbiome identification. NPJ Biofilms Microbiomes 2018; 4:20. [PMID: 30210803 PMCID: PMC6123480 DOI: 10.1038/s41522-018-0065-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2018] [Revised: 07/19/2018] [Accepted: 08/02/2018] [Indexed: 02/07/2023] Open
Abstract
Human gut microbiomes consist of a large number of microbial genomes, which vary by diet and health conditions and from individual to individual. In the present work, we asked whether such variation or similarity could be measured and, if so, whether the results could be used for personal microbiome identification (PMI). To address this question, we herein propose a method to estimate the significance of similarity among human gut metagenomic samples based on reference-free, long k-mer features. Using these features, we find that pairwise similarities between the metagenomes of any two individuals obey a beta distribution and that a p value derived accordingly well characterizes whether two samples are from the same individual or not. We develop a computational framework called GePMI (Generating inter-individual similarity distribution for Personal Microbiome Identification) and apply it to several human gut metagenomic datasets (>300 individuals and >600 samples in total). From the results of GePMI, most of the human gut microbiomes can be identified (auROC = 0.9470, auPRC = 0.8702). Even after antibiotic treatment or fecal microbiota transplantation, the individual k-mer signature still maintains a certain specificity.
Collapse
Affiliation(s)
- Zicheng Wang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNLIST and Department of Automation, Tsinghua University, 100084 Beijing, China
| | - Huazhe Lou
- Bioinformatics Division, BNLIST and Department of Computer Science and Technology, Tsinghua University, 100084 Beijing, China
| | - Ying Wang
- Department of Automation, Xiamen University, 361005 Fujian, China
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv, Israel
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNLIST and Department of Automation, Tsinghua University, 100084 Beijing, China
| | - Ting Chen
- Bioinformatics Division, BNLIST and Department of Computer Science and Technology, Tsinghua University, 100084 Beijing, China
| |
Collapse
|
22
|
Wang Y, Fu L, Ren J, Yu Z, Chen T, Sun F. Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures. Front Microbiol 2018; 9:872. [PMID: 29774017 PMCID: PMC5943621 DOI: 10.3389/fmicb.2018.00872] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Accepted: 04/16/2018] [Indexed: 12/19/2022] Open
Abstract
Comparing metagenomic samples is crucial for understanding microbial communities. For different groups of microbial communities, such as human gut metagenomic samples from patients with a certain disease and healthy controls, identifying group-specific sequences offers essential information for potential biomarker discovery. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered "group-specific" in our study. Our main purpose is to discover group-specific sequence regions between control and case groups as disease-associated markers. We developed a long k-mer (k ≥ 30 bps)-based computational pipeline to detect group-specific sequences at strain resolution free from reference sequences, sequence alignments, and metagenome-wide de novo assembly. We called our method MetaGO: Group-specific oligonucleotide analysis for metagenomic samples. An open-source pipeline on Apache Spark was developed with parallel computing. We applied MetaGO to one simulated and three real metagenomic datasets to evaluate the discriminative capability of identified group-specific markers. In the simulated dataset, 99.11% of group-specific logical 40-mers covered 98.89% disease-specific regions from the disease-associated strain. In addition, 97.90% of group-specific numerical 40-mers covered 99.61 and 96.39% of differentially abundant genome and regions between two groups, respectively. For a large-scale metagenomic liver cirrhosis (LC)-associated dataset, we identified 37,647 group-specific 40-mer features. Any one of the features can predict disease status of the training samples with the average of sensitivity and specificity higher than 0.8. The random forests classification using the top 10 group-specific features yielded a higher AUC (from ∼0.8 to ∼0.9) than that of previous studies. All group-specific 40-mers were present in LC patients, but not healthy controls. All the assembled 11 LC-specific sequences can be mapped to two strains of Veillonella parvula: UTDB1-3 and DSM2008. The experiments on the other two real datasets related to Inflammatory Bowel Disease and Type 2 Diabetes in Women consistently demonstrated that MetaGO achieved better prediction accuracy with fewer features compared to previous studies. The experiments showed that MetaGO is a powerful tool for identifying group-specific k-mers, which would be clinically applicable for disease prediction. MetaGO is available at https://github.com/VVsmileyx/MetaGO.
Collapse
Affiliation(s)
- Ying Wang
- Department of Automation, Xiamen University, Xiamen, China
| | - Lei Fu
- Department of Automation, Xiamen University, Xiamen, China
| | - Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA, United States
| | - Zhaoxia Yu
- Department of Statistics, University of California, Irvine, Irvine, CA, United States
| | - Ting Chen
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA, United States
- Bioinformatics Division, Tsinghua National Laboratory of Information Science and Technology, Tsinghua University, Beijing, China
- Department of Computer Science and Technology, Tsinghua University, Beijing, China
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA, United States
- Center for Computational Systems Biology, Fudan University, Shanghai, China
| |
Collapse
|
23
|
Jia Y, Li H, Wang J, Meng H, Yang Z. Spectrum structures and biological functions of 8-mers in the human genome. Genomics 2018. [PMID: 29522801 DOI: 10.1016/j.ygeno.2018.03.006] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The spectra of k-mer frequencies can reveal the structures and evolution of genome sequences. We confirmed that the trimodal spectrum of 8-mers in human genome sequences is distinguished only by CG2, CG1 and CG0 8-mer sets, containing 2,1 or 0 CpG, respectively. This phenomenon is called independent selection law. The three types of CG 8-mers were considered as different functional elements. We conjectured that (1) nucleosome binding motifs are mainly characterized by CG1 8-mers and (2) the core structural units of CpG island sequences are predominantly characterized by CG2 8-mers. To validate our conjectures, nucleosome occupied sequences and CGI sequences were extracted, then the sequence parameters were constructed through the information of the three CG 8-mer sets respectively. ROC analysis showed that CG1 8-mers are more preference in nucleosome occupied segments (AUC > 0.7) and CG2 8-mers are more preference in CGI sequences (AUC > 0.99). This validates our conjecture in principle.
Collapse
Affiliation(s)
- Yun Jia
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China; College of Science, Inner Mongolia University of Technology, Hohhot 010051, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China.
| | - Jingfeng Wang
- College of Science, Inner Mongolia University of Technology, Hohhot 010051, China
| | - Hu Meng
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Zhenhua Yang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| |
Collapse
|
24
|
Zheng Y, Li H, Wang Y, Meng H, Zhang Q, Zhao X. Evolutionary mechanism and biological functions of 8-mers containing CG dinucleotide in yeast. Chromosome Res 2017; 25:173-189. [PMID: 28181048 DOI: 10.1007/s10577-017-9554-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Revised: 12/27/2016] [Accepted: 01/27/2017] [Indexed: 01/01/2023]
Abstract
The rules of k-mer non-random usage and the biological functions are worthy of special attention. Firstly, the article studied human 8-mer spectra and found that only the spectra of cytosine-guanine (CG) dinucleotide classification formed independent unimodal distributions when the 8-mers were classified into three subsets under 16 dinucleotide classifications. Secondly, the distribution rules were reproduced by other seven species including yeast, which showed that the evolution phenomenon had species universality. It followed that we proposed two theoretical conjectures: (1) CG1 motifs (8-mers including 1 CG) are the nucleosome-binding motifs. (2) CG2 motifs (8-mers including two or more than two CG) are the modular units of CpG islands. Our conjectures were confirmed in yeast by the following results: a maximum of average area under the receiver operating characteristic (AUC) resulted from CG1 information during nucleosome core sequences, and linker sequences were distinguished by three CG subsets; there was a one-to-one relationship between abundant CG1 signal regions and histone positions; the sequence changing of squeezed nucleosomes was relevant with the strength of CG1 signals; and the AUC value of 0.986 was based on CG2 information when CpG islands and non-CpG islands were distinguished by the three CG subsets.
Collapse
Affiliation(s)
- Yan Zheng
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China. .,, No.235, West University Street, Hohhot, Inner Mongolia, China.
| | - Yue Wang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Hu Meng
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Qiang Zhang
- College of Science, Inner Mongolia Agricultural University, Hohhot, 010018, China
| | - Xiaoqing Zhao
- Biotechnology research centre, Inner Mongolia Academy of Agricultural and Animal Husbandry Science, Hohhot, 010021, China
| |
Collapse
|
25
|
Bonnici V, Manca V. Informational laws of genome structures. Sci Rep 2016; 6:28840. [PMID: 27354155 PMCID: PMC4937431 DOI: 10.1038/srep28840] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Accepted: 06/09/2016] [Indexed: 01/06/2023] Open
Abstract
In recent years, the analysis of genomes by means of strings of length k occurring in the genomes, called k-mers, has provided important insights into the basic mechanisms and design principles of genome structures. In the present study, we focus on the proper choice of the value of k for applying information theoretic concepts that express intrinsic aspects of genomes. The value k = lg2(n), where n is the genome length, is determined to be the best choice in the definition of some genomic informational indexes that are studied and computed for seventy genomes. These indexes, which are based on information entropies and on suitable comparisons with random genomes, suggest five informational laws, to which all of the considered genomes obey. Moreover, an informational genome complexity measure is proposed, which is a generalized logistic map that balances entropic and anti-entropic components of genomes and is related to their evolutionary dynamics. Finally, applications to computational synthetic biology are briefly outlined.
Collapse
Affiliation(s)
- Vincenzo Bonnici
- University of Verona, Department of Computer Science, University of Verona, Verona 37134, Italy,Center for BioMedical Computing, University of Verona, Verona, 37134, Italy
| | - Vincenzo Manca
- University of Verona, Department of Computer Science, University of Verona, Verona 37134, Italy,Center for BioMedical Computing, University of Verona, Verona, 37134, Italy,
| |
Collapse
|
26
|
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 2016; 17:132. [PMID: 27323842 PMCID: PMC4915045 DOI: 10.1186/s13059-016-0997-x] [Citation(s) in RCA: 1503] [Impact Index Per Article: 187.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2015] [Accepted: 06/03/2016] [Indexed: 02/07/2023] Open
Abstract
Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license (https://github.com/marbl/mash).
Collapse
Affiliation(s)
- Brian D Ondov
- National Biodefense Analysis and Countermeasures Center, Frederick, MD, USA
| | - Todd J Treangen
- National Biodefense Analysis and Countermeasures Center, Frederick, MD, USA
| | - Páll Melsted
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavik, Iceland
| | - Adam B Mallonee
- National Biodefense Analysis and Countermeasures Center, Frederick, MD, USA
| | - Nicholas H Bergman
- National Biodefense Analysis and Countermeasures Center, Frederick, MD, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
27
|
Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter. J Theor Biol 2015; 387:88-100. [PMID: 26427337 DOI: 10.1016/j.jtbi.2015.09.014] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2015] [Revised: 09/10/2015] [Accepted: 09/15/2015] [Indexed: 12/20/2022]
Abstract
Empirical analysis on k-mer DNA has been proven as an effective tool in finding unique patterns in DNA sequences which can lead to the discovery of potential sequence motifs. In an extensive study of empirical k-mer DNA on hundreds of organisms, the researchers found unique multi-modal k-mer spectra occur in the genomes of organisms from the tetrapod clade only which includes all mammals. The multi-modality is caused by the formation of the two lowest modes where k-mers under them are referred as the rare k-mers. The suppression of the two lowest modes (or the rare k-mers) can be attributed to the CG dinucleotide inclusions in them. Apart from that, the rare k-mers are selectively distributed in certain genomic features of CpG Island (CGI), promoter, 5' UTR, and exon. We correlated the rare k-mers with hundreds of annotated features using several bioinformatic tools, performed further intrinsic rare k-mer analyses within the correlated features, and modeled the elucidated rare k-mer clustering feature into a classifier to predict the correlated CGI and promoter features. Our correlation results show that rare k-mers are highly associated with several annotated features of CGI, promoter, 5' UTR, and open chromatin regions. Our intrinsic results show that rare k-mers have several unique topological, compositional, and clustering properties in CGI and promoter features. Finally, the performances of our RWC (rare-word clustering) method in predicting the CGI and promoter features are ranked among the top three, in eight of the CGI and promoter evaluations, among eight of the benchmarked datasets.
Collapse
|
28
|
Daly GM, Leggett RM, Rowe W, Stubbs S, Wilkinson M, Ramirez-Gonzalez RH, Caccamo M, Bernal W, Heeney JL. Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data. PLoS One 2015; 10:e0129059. [PMID: 26098299 PMCID: PMC4476701 DOI: 10.1371/journal.pone.0129059] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2014] [Accepted: 05/04/2015] [Indexed: 12/18/2022] Open
Abstract
The use of next generation sequencing (NGS) to identify novel viral sequences from eukaryotic tissue samples is challenging. Issues can include the low proportion and copy number of viral reads and the high number of contigs (post-assembly), making subsequent viral analysis difficult. Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data. Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods. This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.
Collapse
Affiliation(s)
- Gordon M. Daly
- Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom
| | - Richard M. Leggett
- The Genome Analysis Centre (TGAC), Norwich Research Park, Norwich, NR47UH, United Kingdom
| | - William Rowe
- Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom
| | - Samuel Stubbs
- Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom
| | - Maxim Wilkinson
- Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom
| | | | - Mario Caccamo
- The Genome Analysis Centre (TGAC), Norwich Research Park, Norwich, NR47UH, United Kingdom
| | - William Bernal
- Institute of Liver Studies, King's College Hospital, Denmark Hill, London, SE59RS, United Kingdom
| | - Jonathan L. Heeney
- Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom
- * E-mail:
| |
Collapse
|
29
|
Rojas M, Golovko G, Khanipov K, Albayrak L, Chumakov S, Pettitt BM, Strongin AY, Fofanov Y. Secondary Analysis of the NCI-60 Whole Exome Sequencing Data Indicates Significant Presence of Propionibacterium acnes Genomic Material in Leukemia (RPMI-8226) and Central Nervous System (SF-295, SF-539, and SNB-19) Cell Lines. PLoS One 2015; 10:e0127799. [PMID: 26039084 PMCID: PMC4454691 DOI: 10.1371/journal.pone.0127799] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2015] [Accepted: 04/18/2015] [Indexed: 11/25/2022] Open
Abstract
The NCI-60 human tumor cell line panel has been used in a broad range of cancer research over the last two decades. A landmark 2013 whole exome sequencing study of this panel added an exceptional new resource for cancer biologists. The complementary analysis of the sequencing data produced by this study suggests the presence of Propionibacterium acnes genomic sequences in almost half of the datasets, with the highest abundance in the leukemia (RPMI-8226) and central nervous system (SF-295, SF-539, and SNB-19) cell lines. While the origin of these contaminating bacterial sequences remains to be determined, observed results suggest that computational control for the presence of microbial genomic material is a necessary step in the analysis of the high throughput sequencing (HTS) data.
Collapse
Affiliation(s)
- Mark Rojas
- Department of Pharmacology and Toxicology, University of Texas Medical Branch, Galveston, Texas, United States of America
- Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, Texas, United States of America
- * E-mail:
| | - Georgiy Golovko
- Department of Pharmacology and Toxicology, University of Texas Medical Branch, Galveston, Texas, United States of America
- Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, Texas, United States of America
| | - Kamil Khanipov
- Department of Pharmacology and Toxicology, University of Texas Medical Branch, Galveston, Texas, United States of America
- Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, Texas, United States of America
| | - Levent Albayrak
- Department of Pharmacology and Toxicology, University of Texas Medical Branch, Galveston, Texas, United States of America
- Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, Texas, United States of America
| | - Sergei Chumakov
- Department of Physics, University of Guadalajara, Guadalajara, Jalisco, Mexico
| | - B. Montgomery Pettitt
- Department of Pharmacology and Toxicology, University of Texas Medical Branch, Galveston, Texas, United States of America
- Department of Biochemistry and Molecular Biology, University of Texas Medical Branch, Galveston, Texas, United States of America
- Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, Texas, United States of America
| | - Alex Y. Strongin
- Inflammatory and Infectious Disease Center/Cancer Research Center, Sanford-Burnham Medical Research Institute, La Jolla, California, United States of America
| | - Yuriy Fofanov
- Department of Pharmacology and Toxicology, University of Texas Medical Branch, Galveston, Texas, United States of America
- Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, Texas, United States of America
| |
Collapse
|
30
|
Tabb LP, Zhao W, Huang J, Rosen GL. Characterizing the empirical distribution of prokaryotic genome n-mers in the presence of nullomers. J Comput Biol 2014; 21:732-40. [PMID: 25075627 DOI: 10.1089/cmb.2014.0108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Characterizing the empirical distribution of the frequency of n-mers is a vital step in understanding the entire genome. This will allow for researchers to examine how complex the genome really is, and move beyond simple, traditional modeling frameworks that are often biased in the presence of abundant and/or extremely rare words. We hypothesize that models based on the negative binomial distribution and its zero-inflated counterpart will characterize the n-mer distributions of genomes better than the Poisson. Our study examined the empirical distribution of the frequency of n-mers (6 ≤ n ≤ 11) in 2,199 genomes. We considered four distributions: Poisson, negative binomial, zero-inflated Poisson, and zero-inflated negative binomial (ZINB). The number of genomes that have nullomers in 6-, 7-, and 8-mers was 150, 602 and 2,012, respectively, whereas all of the genomes for the 9-, 10-, and 11-mers had nullomers. In each n-mer considered, the negative binomial model performed the best for at least 93% of the 2,199 genomes; however, a small percentage (i.e., <7%) of the genomes did prefer the ZINB. The negative binomial and zero-inflation distributions extend the traditional Poisson setting and are more flexible in handling overdispersion that can be caused by an increase in nullomers. In an effort to characterize the distribution of the frequency of n-mers, researchers should also consider other discrete distributions that are more flexible and adjust for possible overdispersion.
Collapse
Affiliation(s)
- Loni Philip Tabb
- 1 Department of Epidemiology & Biostatistics, Drexel University , Philadelphia, Pennsylvania
| | | | | | | |
Collapse
|
31
|
Wang Y, Leung HCM, Yiu SM, Chin FYL. MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning. BMC Genomics 2014; 15 Suppl 1:S12. [PMID: 24564377 PMCID: PMC4046714 DOI: 10.1186/1471-2164-15-s1-s12] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Taxonomic annotation of reads is an important problem in metagenomic analysis. Existing annotation tools, which rely on the approach of aligning each read to the taxonomic structure, are unable to annotate many reads efficiently and accurately as reads (~100 bp) are short and most of them come from unknown genomes. Previous work has suggested assembling the reads to make longer contigs before annotation. More reads/contigs can be annotated as a longer contig (in Kbp) can be aligned to a taxon even if it is from an unknown species as long as it contains a conserved region of that taxon. Unfortunately existing metagenomic assembly tools are not mature enough to produce long enough contigs. Binning tries to group reads/contigs of similar species together. Intuitively, reads in the same group (cluster) should be annotated to the same taxon and these reads altogether should cover a significant portion of the genome alleviating the problem of short contigs if the quality of binning is high. However, no existing work has tried to use binning results to help solve the annotation problem. This work explores this direction. RESULTS In this paper, we describe MetaCluster-TA, an assembly-assisted binning-based annotation tool which relies on an innovative idea of annotating binned reads instead of aligning each read or contig to the taxonomic structure separately. We propose the novel concept of the 'virtual contig' (which can be up to 10 Kb in length) to represent a set of reads and then represent each cluster as a set of 'virtual contigs' (which together can be total up to 1 Mb in length) for annotation. MetaCluster-TA can outperform widely-used MEGAN4 and can annotate (1) more reads since the virtual contigs are much longer; (2) more accurately since each cluster of long virtual contigs contains global information of the sampled genome which tends to be more accurate than short reads or assembled contigs which contain only local information of the genome; and (3) more efficiently since there are much fewer long virtual contigs to align than short reads. MetaCluster-TA outperforms MetaCluster 5.0 as a binning tool since binning itself can be more sensitive and precise given long virtual contigs and the binning results can be improved using the reference taxonomic database. CONCLUSIONS MetaCluster-TA can outperform widely-used MEGAN4 and can annotate more reads with higher accuracy and higher efficiency. It also outperforms MetaCluster 5.0 as a binning tool.
Collapse
Affiliation(s)
- Yi Wang
- Department of Computer Science, The University of Hong Kong, Kragujevac, Hong Kong
| | - Henry Chi Ming Leung
- Department of Computer Science, The University of Hong Kong, Kragujevac, Hong Kong
| | - Siu Ming Yiu
- Department of Computer Science, The University of Hong Kong, Kragujevac, Hong Kong
| | - Francis Yuk Lun Chin
- Department of Computer Science, The University of Hong Kong, Kragujevac, Hong Kong
| |
Collapse
|
32
|
Solovyov A, Lipkin WI. Centroid based clustering of high throughput sequencing reads based on n-mer counts. BMC Bioinformatics 2013; 14:268. [PMID: 24011402 PMCID: PMC3848435 DOI: 10.1186/1471-2105-14-268] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2013] [Accepted: 09/05/2013] [Indexed: 11/10/2022] Open
Abstract
Background Many problems in computational biology require alignment-free sequence comparisons. One of the common tasks involving sequence comparison is sequence clustering. Here we apply methods of alignment-free comparison (in particular, comparison using sequence composition) to the challenge of sequence clustering. Results We study several centroid based algorithms for clustering sequences based on word counts. Study of their performance shows that using k-means algorithm with or without the data whitening is efficient from the computational point of view. A higher clustering accuracy can be achieved using the soft expectation maximization method, whereby each sequence is attributed to each cluster with a specific probability. We implement an open source tool for alignment-free clustering. It is publicly available from github:
https://github.com/luscinius/afcluster. Conclusions We show the utility of alignment-free sequence clustering for high throughput sequencing analysis despite its limitations. In particular, it allows one to perform assembly with reduced resources and a minimal loss of quality. The major factor affecting performance of alignment-free read clustering is the length of the read.
Collapse
Affiliation(s)
- Alexander Solovyov
- Center for Infection and Immunity, Columbia University, New York, NY, 10032, USA.
| | | |
Collapse
|
33
|
Bystrykh LV. A combinatorial approach to the restriction of a mouse genome. BMC Res Notes 2013; 6:284. [PMID: 23875927 PMCID: PMC3724700 DOI: 10.1186/1756-0500-6-284] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2012] [Accepted: 07/16/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A fragmentation of genomic DNA by restriction digestion is a popular step in many applications. Usually attention is paid to the expected average size of the DNA fragments. Another important parameter, randomness of restriction, is regularly implied but rarely verified. This parameter is crucial to the expectation, that either all fragments made by restriction will be suitable for the method of choice, or only a fraction of those will be effectively used by the method. If only a fraction of the fragments are used, we often should know whether the used fragments are representative of the whole genome. With a modern knowledge of mouse, human and many other genomes, frequencies and distributions of restriction sites and sizes of corresponding DNA fragments can be analyzed in silico. In this manuscript, the mouse genome was systematically scanned for frequencies of complementary 4-base long palindromes. FINDINGS AND CONCLUSIONS The study revealed substantial heterogeneity in distribution of those sites genome-wide. Only few palindromes showed close to random pattern of distribution. Overall, the distribution of frequencies for most palindromes is much wider than expected by random occurrence. In practical terms, accessibility of genome upon restriction can be improved by a selective combination of restrictases using a few combinatorial rules. It is recommended to mix at least 3 restrictases, their recognition sequences (palindrome) should be the least similar to each other. Principles of the optimization and optimal combinations of restrictases are provided.
Collapse
Affiliation(s)
- Leonid V Bystrykh
- Laboratory of Ageing Biology and Stem Cells, European Research Institute for Biology of Ageing, University Medical Center Groningen, University of Groningen, Antonius Deusinglaan 1, 9700 AD, Groningen, The Netherlands.
| |
Collapse
|
34
|
Wang Y, Leung HCM, Yiu SM, Chin FYL. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics 2013; 28:i356-i362. [PMID: 22962452 PMCID: PMC3436824 DOI: 10.1093/bioinformatics/bts397] [Citation(s) in RCA: 97] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Motivation: Metagenomic binning remains an important topic in metagenomic analysis. Existing unsupervised binning methods for next-generation sequencing (NGS) reads do not perform well on (i) samples with low-abundance species or (ii) samples (even with high abundance) when there are many extremely low-abundance species. These two problems are common for real metagenomic datasets. Binning methods that can solve these problems are desirable. Results: We proposed a two-round binning method (MetaCluster 5.0) that aims at identifying both low-abundance and high-abundance species in the presence of a large amount of noise due to many extremely low-abundance species. In summary, MetaCluster 5.0 uses a filtering strategy to remove noise from the extremely low-abundance species. It separate reads of high-abundance species from those of low-abundance species in two different rounds. To overcome the issue of low coverage for low-abundance species, multiple w values are used to group reads with overlapping w-mers, whereas reads from high-abundance species are grouped with high confidence based on a large w and then binning expands to low-abundance species using a relaxed (shorter) w. Compared to the recent tools, TOSS and MetaCluster 4.0, MetaCluster 5.0 can find more species (especially those with low abundance of say 6× to 10×) and can achieve better sensitivity and specificity using less memory and running time. Availability:http://i.cs.hku.hk/~alse/MetaCluster/ Contact:chin@cs.hku.hk
Collapse
Affiliation(s)
- Yi Wang
- Department of Computer Science, The University of Hong Kong, Hong Kong
| | | | | | | |
Collapse
|
35
|
Castellini A, Franco G, Manca V. A dictionary based informational genome analysis. BMC Genomics 2012; 13:485. [PMID: 22985068 PMCID: PMC3577435 DOI: 10.1186/1471-2164-13-485] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2012] [Accepted: 08/28/2012] [Indexed: 11/16/2022] Open
Abstract
Background In the post-genomic era several methods of computational genomics are emerging to understand how the whole information is structured within genomes. Literature of last five years accounts for several alignment-free methods, arisen as alternative metrics for dissimilarity of biological sequences. Among the others, recent approaches are based on empirical frequencies of DNA k-mers in whole genomes. Results Any set of words (factors) occurring in a genome provides a genomic dictionary. About sixty genomes were analyzed by means of informational indexes based on genomic dictionaries, where a systemic view replaces a local sequence analysis. A software prototype applying a methodology here outlined carried out some computations on genomic data. We computed informational indexes, built the genomic dictionaries with different sizes, along with frequency distributions. The software performed three main tasks: computation of informational indexes, storage of these in a database, index analysis and visualization. The validation was done by investigating genomes of various organisms. A systematic analysis of genomic repeats of several lengths, which is of vivid interest in biology (for example to compute excessively represented functional sequences, such as promoters), was discussed, and suggested a method to define synthetic genetic networks. Conclusions We introduced a methodology based on dictionaries, and an efficient motif-finding software application for comparative genomics. This approach could be extended along many investigation lines, namely exported in other contexts of computational genomics, as a basis for discrimination of genomic pathologies.
Collapse
Affiliation(s)
- Alberto Castellini
- Department of Computer Science, Strada Le Grazie 15, 37134 Verona, Italy
| | | | | |
Collapse
|
36
|
Abstract
Motivation: Next-generation sequencing techniques allow us to generate reads from a microbial environment in order to analyze the microbial community. However, assembling of a set of mixed reads from different species to form contigs is a bottleneck of metagenomic research. Although there are many assemblers for assembling reads from a single genome, there are no assemblers for assembling reads in metagenomic data without reference genome sequences. Moreover, the performances of these assemblers on metagenomic data are far from satisfactory, because of the existence of common regions in the genomes of subspecies and species, which make the assembly problem much more complicated. Results: We introduce the Meta-IDBA algorithm for assembling reads in metagenomic data, which contain multiple genomes from different species. There are two core steps in Meta-IDBA. It first tries to partition the de Bruijn graph into isolated components of different species based on an important observation. Then, for each component, it captures the slight variants of the genomes of subspecies from the same species by multiple alignments and represents the genome of one species, using a consensus sequence. Comparison of the performances of Meta-IDBA and existing assemblers, such as Velvet and Abyss for different metagenomic datasets shows that Meta-IDBA can reconstruct longer contigs with similar accuracy. Availability: Meta-IDBA toolkit is available at our website http://www.cs.hku.hk/~alse/metaidba. Contact:chin@cs.hku.hk
Collapse
Affiliation(s)
- Yu Peng
- Department of Computer Science, The University of Hong Kong, Hong Kong
| | | | | | | |
Collapse
|
37
|
Davenport CF, Tümmler B. Abundant oligonucleotides common to most bacteria. PLoS One 2010; 5:e9841. [PMID: 20352124 PMCID: PMC2843746 DOI: 10.1371/journal.pone.0009841] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2009] [Accepted: 03/03/2010] [Indexed: 11/25/2022] Open
Abstract
Background Bacteria show a bias in their genomic oligonucleotide composition far beyond that dictated by G+C content. Patterns of over- and underrepresented oligonucleotides carry a phylogenetic signal and are thus diagnostic for individual species. Patterns of short oligomers have been investigated by multiple groups in large numbers of bacteria genomes. However, global distributions of the most highly overrepresented mid-sized oligomers have not been assessed across all prokaryotes to date. We surveyed overrepresented mid-length oligomers across all prokaryotes and normalised for base composition and embedded oligomers using zero and second order Markov models. Principal Findings Here we report a presumably ancient set of oligomers conserved and overrepresented in nearly all branches of prokaryotic life, including Archaea. These oligomers are either adenine rich homopurines with one to three guanine nucleosides, or homopyridimines with one to four cytosine nucleosides. They do not show a consistent preference for coding or non-coding regions or aggregate in any coding frame, implying a role in DNA structure and as polypeptide binding sites. Structural parameters indicate these oligonucleotides to be an extreme and rigid form of B-DNA prone to forming triple stranded helices under common physiological conditions. Moreover, the narrow minor grooves of these structures are recognised by DNA binding and nucleoid associated proteins such as HU. Conclusion Homopurine and homopyrimidine oligomers exhibit distinct and unusual structural features and are present at high copy number in nearly all prokaryotic lineages. This fact suggests a non-neutral role of these oligonucleotides for bacterial genome organization that has been maintained throughout evolution.
Collapse
Affiliation(s)
- Colin F Davenport
- Pediatric Pneumology and Neonatology, Hanover Medical School, Hanover, Lower Saxony, Germany.
| | | |
Collapse
|
38
|
Chor B, Horn D, Goldman N, Levy Y, Massingham T. Genomic DNA k-mer spectra: models and modalities. Genome Biol 2009; 10:R108. [PMID: 19814784 PMCID: PMC2784323 DOI: 10.1186/gb-2009-10-10-r108] [Citation(s) in RCA: 120] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2009] [Revised: 08/14/2009] [Accepted: 10/08/2009] [Indexed: 11/12/2022] Open
Abstract
Tetrapods, unlike other organisms, have multimodal spectra of k-mers in their genomes Background The empirical frequencies of DNA k-mers in whole genome sequences provide an interesting perspective on genomic complexity, and the availability of large segments of genomic sequence from many organisms means that analysis of k-mers with non-trivial lengths is now possible. Results We have studied the k-mer spectra of more than 100 species from Archea, Bacteria, and Eukaryota, particularly looking at the modalities of the distributions. As expected, most species have a unimodal k-mer spectrum. However, a few species, including all mammals, have multimodal spectra. These species coincide with the tetrapods. Genomic sequences are clearly very complex, and cannot be fully explained by any simple probabilistic model. Yet we sought such an explanation for the observed modalities, and discovered that low-order Markov models capture this property (and some others) fairly well. Conclusions Multimodal spectra are characterized by specific ranges of values of C+G content and of CpG dinucleotide suppression, a range that encompasses all tetrapods analyzed. Other genomes, like that of the protozoa Entamoeba histolytica, which also exhibits CpG suppression, do not have multimodal k-mer spectra. Groupings of functional elements of the human genome also have a clear modality, and exhibit either a unimodal or multimodal behaviour, depending on the two above mentioned values.
Collapse
Affiliation(s)
- Benny Chor
- School of Computer Science, Tel Aviv University, Klausner St, Ramat-Aviv, Tel-Aviv 39040, Israel.
| | | | | | | | | |
Collapse
|
39
|
Acquisti C, Poste G, Curtiss D, Kumar S. Nullomers: really a matter of natural selection? PLoS One 2007; 2:e1022. [PMID: 17925870 PMCID: PMC1995752 DOI: 10.1371/journal.pone.0001022] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2007] [Accepted: 09/19/2007] [Indexed: 11/23/2022] Open
Abstract
Background Nullomers are short DNA sequences that are absent from the genomes of humans and other species. Assuming that nullomers are the signatures of natural selection against deleterious sequences in humans, the use of nullomers in drug target identification, pesticide development, environmental monitoring, and forensic applications has been envisioned. Results Here, we show that the hypermutability of CpG dinucleotides, rather than the natural selection against the nullomer sequences, is likely the reason for the phenomenal event of short sequence motifs becoming nullomers. Furthermore, many reported human nullomers differ by only one nucleotide, which reinforces the role of mutation in the evolution of the constellation of nullomers in populations and species. The known nullomers in chimpanzee, cow, dog, and mouse genomes show patterns that are consistent with those seen in humans. Conclusions The role of mutations, instead of selection, in generating nullomers cast doubt on the utility of nullomers in many envisioned applications, because of their dependence on the role of lethal selection on the origin of nullomers.
Collapse
Affiliation(s)
- Claudia Acquisti
- Center for Evolutionary Functional Genomics, Arizona State University, Tempe, Arizona, United States of America
- The Biodesign Institute and the School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
| | - George Poste
- The Biodesign Institute and the School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
| | - David Curtiss
- Center for Evolutionary Functional Genomics, Arizona State University, Tempe, Arizona, United States of America
| | - Sudhir Kumar
- Center for Evolutionary Functional Genomics, Arizona State University, Tempe, Arizona, United States of America
- The Biodesign Institute and the School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
40
|
King BR, Guda C. ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes. Genome Biol 2007; 8:R68. [PMID: 17472741 PMCID: PMC1929137 DOI: 10.1186/gb-2007-8-5-r68] [Citation(s) in RCA: 86] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2006] [Revised: 02/19/2007] [Accepted: 05/01/2007] [Indexed: 11/23/2022] Open
Abstract
ngLOC is an n-gram-based Bayesian classification method that can predict the localization of a protein sequence over ten distinct subcellular organelles. We present a method called ngLOC, an n-gram-based Bayesian classifier that predicts the localization of a protein sequence over ten distinct subcellular organelles. A tenfold cross-validation result shows an accuracy of 89% for sequences localized to a single organelle, and 82% for those localized to multiple organelles. An enhanced version of ngLOC was developed to estimate the subcellular proteomes of eight eukaryotic organisms: yeast, nematode, fruitfly, mosquito, zebrafish, chicken, mouse, and human.
Collapse
Affiliation(s)
- Brian R King
- Department of Computer Science, State University of New York at Albany, Washington Ave, Albany, New York 12222, USA
- Gen*NY*sis Center for Excellence in Cancer Genomics, State University of New York at Albany, Discovery Drive, Rensselaer, New York 12144-3456, USA
| | - Chittibabu Guda
- Gen*NY*sis Center for Excellence in Cancer Genomics, State University of New York at Albany, Discovery Drive, Rensselaer, New York 12144-3456, USA
- Department of Epidemiology and Biostatistics, State University of New York at Albany, Discovery Drive, Rensselaer, New York 12144-3456, USA
| |
Collapse
|
41
|
Reed C, Fofanov V, Putonti C, Chumakov S, Slezak T, Fofanov Y. Effect of the mutation rate and background size on the quality of pathogen identification. ACTA ACUST UNITED AC 2007; 23:2665-71. [PMID: 17881407 DOI: 10.1093/bioinformatics/btm420] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Genomic-based methods have significant potential for fast and accurate identification of organisms or even genes of interest in complex environmental samples (air, water, soil, food, etc.), especially when isolation of the target organism cannot be performed by a variety of reasons. Despite this potential, the presence of the unknown, variable and usually large quantities of background DNA can cause interference resulting in false positive outcomes. RESULTS In order to estimate how the genomic diversity of the background (total length of all of the different genomes present in the background), target length and target mutation rate affect the probability of misidentifications, we introduce a mathematical definition for the quality of an individual signature in the presence of a background based on its length and number of mismatches needed to transform the signature into the closest subsequence present in the background. This definition, in conjunction with a probabilistic framework, allows one to predict the minimal signature length required to identify the target in the presence of different sizes of backgrounds and the effect of the target's mutation rate on the quality of its identification. The model assumptions and predictions were validated using both Monte Carlo simulations and real genomic data examples. The proposed model can be used to determine appropriate signature lengths for various combinations of target and background genome sizes. It also predicted that any genomic signatures will be unable to identify target if its mutation rate is >5%. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chris Reed
- Department of Computer Science, University of Houston, 501 Philip G. Hoffman Hall, Houston, TX 77204, USA
| | | | | | | | | | | |
Collapse
|
42
|
Salerno W, Havlak P, Miller J. Scale-invariant structure of strongly conserved sequence in genomic intersections and alignments. Proc Natl Acad Sci U S A 2006; 103:13121-5. [PMID: 16924100 PMCID: PMC1559763 DOI: 10.1073/pnas.0605735103] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
A power-law distribution of the length of perfectly conserved sequence from mouse/human whole-genome intersection and alignment is exhibited. Spatial correlations of these elements within the mouse genome are studied. It is argued that these power-law distributions and correlations are comprised in part by functional noncoding sequence and ought to be accounted for in estimating the statistical significance of apparent sequence conservation. These inter-genomic correlations of conservation are placed in the context of previously observed intra-genomic correlations, and their possible origins and consequences are discussed.
Collapse
Affiliation(s)
| | - Paul Havlak
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030
| | - Jonathan Miller
- *Department of Biochemistry and Molecular Biology and
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030
- To whom correspondence should be addressed. E-mail:
| |
Collapse
|
43
|
Abstract
MicroRNAs are short (∼22 nt) regulatory RNA molecules that play key roles in metazoan development and have been implicated in human disease. First discovered in Caenorhabditis elegans, over 2500 microRNAs have been isolated in metazoans and plants; it has been estimated that there may be more than a thousand microRNA genes in the human genome alone. Motivated by the experimental observation of strong conservation of the microRNA let-7 among nearly all metazoans, we developed a novel methodology to characterize the class of such strongly conserved sequences: we identified a non-redundant set of all sequences 20 to 29 bases in length that are shared among three insects: fly, bee and mosquito. Among the few hundred sequences greater than 20 bases in length are close to 40% of the 78 confirmed fly microRNAs, along with other non-coding RNAs and coding sequence.
Collapse
Affiliation(s)
- T. Tran
- Department of Biochemistry, Baylor College of Medicine TX, USA
| | - P. Havlak
- Department of Human Genome Sequencing Center, Baylor College of Medicine TX, USA
| | - J. Miller
- Department of Biochemistry, Baylor College of Medicine TX, USA
- To whom correspondence should be addressed. Tel: +1 713 798 3542; Fax: +1 713 796 9438;
| |
Collapse
|
44
|
Putonti C, Chumakov S, Mitra R, Fox GE, Willson RC, Fofanov Y. Human-blind probes and primers for dengue virus identification. FEBS J 2006; 273:398-408. [PMID: 16403026 DOI: 10.1111/j.1742-4658.2005.05074.x] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
Reliable detection and identification of pathogens in complex biological samples, in the presence of contaminating DNA from a variety of sources, is an important and challenging diagnostic problem for the development of field tests. The problem is compounded by the difficulty of finding a single, unique genomic sequence that is present simultaneously in all genomes of a species of closely related pathogens and absent in the genomes of the host or the organisms that contribute to the sample background. Here we describe 'host-blind probe design'- a novel strategy of designing probes based on highly frequent genomic signatures found in the pathogen genomes of interest but absent from the host genome. Upon hybridization, an array of such informative probes will produce a unique pattern that is a genetic fingerprint for each pathogen strain. This multiprobe approach was applied to 83 dengue virus genome sequences, available in public databases, to design and perform in silico microarray experiments. The resulting patterns allow one to unequivocally distinguish the four major serotypes, and within each serotype to identify the most similar strain among those that have been completely sequenced. In an environment where dengue is indigenous, this would allow investigators to determine if a particular isolate belongs to an ongoing outbreak or is a previously circulating version. Using our probe set, the probability that misdiagnosis at the serotype level would occur is approximately 1 : 10(150).
Collapse
Affiliation(s)
- Catherine Putonti
- Department of Computer Science, University of Houston, Houston, TX, USA.
| | | | | | | | | | | |
Collapse
|
45
|
Gangal R, Sharma P. Human pol II promoter prediction: time series descriptors and machine learning. Nucleic Acids Res 2005; 33:1332-6. [PMID: 15741185 PMCID: PMC552959 DOI: 10.1093/nar/gki271] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2004] [Revised: 02/08/2005] [Accepted: 02/08/2005] [Indexed: 11/14/2022] Open
Abstract
Although several in silico promoter prediction methods have been developed to date, they are still limited in predictive performance. The limitations are due to the challenge of selecting appropriate features of promoters that distinguish them from non-promoters and the generalization or predictive ability of the machine-learning algorithms. In this paper we attempt to define a novel approach by using unique descriptors and machine-learning methods for the recognition of eukaryotic polymerase II promoters. In this study, non-linear time series descriptors along with non-linear machine-learning algorithms, such as support vector machine (SVM), are used to discriminate between promoter and non-promoter regions. The basic idea here is to use descriptors that do not depend on the primary DNA sequence and provide a clear distinction between promoter and non-promoter regions. The classification model built on a set of 1000 promoter and 1500 non-promoter sequences, showed a 10-fold cross-validation accuracy of 87% and an independent test set had an accuracy >85% in both promoter and non-promoter identification. This approach correctly identified all 20 experimentally verified promoters of human chromosome 22. The high sensitivity and selectivity indicates that n-mer frequencies along with non-linear time series descriptors, such as Lyapunov component stability and Tsallis entropy, and supervised machine-learning methods, such as SVMs, can be useful in the identification of pol II promoters.
Collapse
Affiliation(s)
- Rajeev Gangal
- SciNova Technologies Pvt. Ltd528/43 Vishwashobha, Adjacent to Modi Ganpati, Narayan Peth, Pune 411030, Maharashtra, India
| | - Pankaj Sharma
- SciNova Technologies Pvt. Ltd528/43 Vishwashobha, Adjacent to Modi Ganpati, Narayan Peth, Pune 411030, Maharashtra, India
| |
Collapse
|
46
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2005. [PMCID: PMC2447482 DOI: 10.1002/cfg.421] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|