Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Fofanov Y, Luo Y, Katili C, Wang J, Belosludtsev Y, Powdrill T, Belapurkar C, Fofanov V, Li TB, Chumakov S, Pettitt BM. How independent are the appearances of n-mers in different genomes? Bioinformatics 2004;20:2421-8. [PMID: 15087315 DOI: 10.1093/bioinformatics/bth266] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open

For:	Fofanov Y, Luo Y, Katili C, Wang J, Belosludtsev Y, Powdrill T, Belapurkar C, Fofanov V, Li TB, Chumakov S, Pettitt BM. How independent are the appearances of n-mers in different genomes? Bioinformatics 2004;20:2421-8. [PMID: 15087315 DOI: 10.1093/bioinformatics/bth266] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open

Number

Cited by Other Article(s)

Chantzi N, Mareboina M, Konnaris MA, Montgomery A, Patsakis M, Mouratidis I, Georgakopoulos-Soares I. The determinants of the rarity of nucleic and peptide short sequences in nature. NAR Genom Bioinform 2024;6:lqae029. [PMID: 38584871 PMCID: PMC10993293 DOI: 10.1093/nargab/lqae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 02/21/2024] [Accepted: 03/18/2024] [Indexed: 04/09/2024] Open

Roberts M, Josephs EB. Previously unmeasured genetic diversity explains part of Lewontin's paradox in a k-mer-based meta-analysis of 112 plant species. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.17.594778. [PMID: 38798362 PMCID: PMC11118579 DOI: 10.1101/2024.05.17.594778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]

Ponsero AJ, Miller M, Hurwitz BL. Comparison of k-mer-based de novo comparative metagenomic tools and approaches. MICROBIOME RESEARCH REPORTS 2023;2:27. [PMID: 38058765 PMCID: PMC10696585 DOI: 10.20517/mrr.2023.26] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 06/28/2023] [Accepted: 07/12/2023] [Indexed: 12/08/2023]

Abstract

Aim: Comparative metagenomic analysis requires measuring a pairwise similarity between metagenomes in the dataset. Reference-based methods that compute a beta-diversity distance between two metagenomes are highly dependent on the quality and completeness of the reference database, and their application on less studied microbiota can be challenging. On the other hand, de-novo comparative metagenomic methods only rely on the sequence composition of metagenomes to compare datasets. While each one of these approaches has its strengths and limitations, their comparison is currently limited. Methods: We developed sets of simulated short-reads metagenomes to (1) compare k-mer-based and taxonomy-based distances and evaluate the impact of technical and biological variables on these metrics and (2) evaluate the effect of k-mer sketching and filtering. We used a real-world metagenomic dataset to provide an overview of the currently available tools for de novo metagenomic comparative analysis. Results: Using simulated metagenomes of known composition and controlled error rate, we showed that k-mer-based distance metrics were well correlated to the taxonomic distance metric for quantitative Beta-diversity metrics, but the correlation was low for presence/absence distances. The community complexity in terms of taxa richness and the sequencing depth significantly affected the quality of the k-mer-based distances, while the impact of low amounts of sequence contamination and sequencing error was limited. Finally, we benchmarked currently available de-novo comparative metagenomic tools and compared their output on two datasets of fecal metagenomes and showed that most k-mer-based tools were able to recapitulate the data structure observed using taxonomic approaches. Conclusion: This study expands our understanding of the strength and limitations of k-mer-based de novo comparative metagenomic approaches and aims to provide concrete guidelines for researchers interested in applying these approaches to their metagenomic datasets.

Collapse

Xu X, Yin Z, Yan L, Zhang H, Xu B, Wei Y, Niu B, Schmidt B, Liu W. RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches. Genome Biol 2023;24:121. [PMID: 37198663 PMCID: PMC10190105 DOI: 10.1186/s13059-023-02961-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Accepted: 05/05/2023] [Indexed: 05/19/2023] Open

Zheng Y, Shi J, Chen Q, Deng C, Yang F, Wang Y. Identifying individual-specific microbial DNA fingerprints from skin microbiomes. Front Microbiol 2022;13:960043. [PMID: 36274714 PMCID: PMC9583911 DOI: 10.3389/fmicb.2022.960043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Accepted: 09/16/2022] [Indexed: 11/22/2022] Open

Mc Cartney AM, Shafin K, Alonge M, Bzikadze AV, Formenti G, Fungtammasan A, Howe K, Jain C, Koren S, Logsdon GA, Miga KH, Mikheenko A, Paten B, Shumate A, Soto DC, Sović I, Wood JMD, Zook JM, Phillippy AM, Rhie A. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat Methods 2022;19:687-695. [PMID: 35361931 PMCID: PMC9812399 DOI: 10.1038/s41592-022-01440-3] [Citation(s) in RCA: 40] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Accepted: 03/04/2022] [Indexed: 01/07/2023]

Affiliation(s)

Ann M. Mc Cartney Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH
Kishwar Shafin UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
Michael Alonge Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Andrey V. Bzikadze Graduate Program in Bioinformatics and Systems Biology, University of California San Diego, La Jolla, CA, USA
Giulio Formenti Laboratory of Neurogenetics of Language and The Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
Arkarachai Fungtammasan DNAnexus, Mountain View, CA, USA
Kerstin Howe Wellcome Sanger Institute, Cambridge, UK
Chirag Jain Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH,8.Department of Computational and Data Sciences, Indian Institute of Science, Bangalore KA, India
Sergey Koren Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH
Glennis A. Logsdon Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
Karen H. Miga UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
Alla Mikheenko Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia
Benedict Paten UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
Alaina Shumate Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
Daniela C. Soto Genome Center, MIND Institute, Department of Biochemistry and Molecular Medicine, University of California, Davis, CA, USA
Ivan Sović Pacific Biosciences, Menlo Park, CA, USA,14.Digital BioLogic d.o.o., Ivanić-Grad, Croatia
Jonathan MD Wood Wellcome Sanger Institute, Cambridge, UK
Justin M. Zook Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
Adam M. Phillippy Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH,*Correspondence: ,
Arang Rhie Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH,*Correspondence: ,

Collapse

Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods 2022;19:696-704. [PMID: 35361932 PMCID: PMC9745813 DOI: 10.1038/s41592-022-01445-y] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Accepted: 03/07/2022] [Indexed: 12/15/2022]

Hoarfrost A, Aptekmann A, Farfañuk G, Bromberg Y. Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter. Nat Commun 2022;13:2606. [PMID: 35545619 PMCID: PMC9095714 DOI: 10.1038/s41467-022-30070-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Accepted: 03/30/2022] [Indexed: 12/22/2022] Open

Hoyt SJ, Storer JM, Hartley GA, Grady PGS, Gershman A, de Lima LG, Limouse C, Halabian R, Wojenski L, Rodriguez M, Altemose N, Rhie A, Core LJ, Gerton JL, Makalowski W, Olson D, Rosen J, Smit AFA, Straight AF, Vollger MR, Wheeler TJ, Schatz MC, Eichler EE, Phillippy AM, Timp W, Miga KH, O’Neill RJ. From telomere to telomere: The transcriptional and epigenetic state of human repeat elements. Science 2022;376:eabk3112. [PMID: 35357925 PMCID: PMC9301658 DOI: 10.1126/science.abk3112] [Citation(s) in RCA: 114] [Impact Index Per Article: 57.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]

Affiliation(s)

Savannah J. Hoyt Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
Jessica M. Storer Institute for Systems Biology, Seattle, WA, USA
Gabrielle A. Hartley Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
Patrick G. S. Grady Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
Ariel Gershman Department of Molecular Biology and Genetics, Johns Hopkins University, Baltimore, MD, USA
Leonardo G. de Lima Stowers Institute for Medical Research, Kansas City, MO, USA
Charles Limouse Department of Biochemistry, Stanford University, Stanford, CA, USA
Reza Halabian Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
Luke Wojenski Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
Matias Rodriguez Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
Nicolas Altemose Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
Arang Rhie Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
Leighton J. Core Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
Jennifer L. Gerton Stowers Institute for Medical Research, Kansas City, MO, USA
Wojciech Makalowski Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
Daniel Olson Department of Computer Science, University of Montana, Missoula, MT, USA
Jeb Rosen Institute for Systems Biology, Seattle, WA, USA
Arian F. A. Smit Institute for Systems Biology, Seattle, WA, USA
Aaron F. Straight Department of Biochemistry, Stanford University, Stanford, CA, USA
Mitchell R. Vollger Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
Travis J. Wheeler Department of Computer Science, University of Montana, Missoula, MT, USA
Michael C. Schatz Department of Computer Science and Department of Biology, Johns Hopkins University, Baltimore, MD, USA
Evan E. Eichler Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
Adam M. Phillippy Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
Winston Timp Department of Molecular Biology and Genetics, Johns Hopkins University, Baltimore, MD, USA Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
Karen H. Miga UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
Rachel J. O’Neill Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA Department of Genetics and Genome Sciences, UConn Health, Farmington, CT, USA

Collapse

Katz KS, Shutov O, Lapoint R, Kimelman M, Brister JR, O’Sullivan C. STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions. Genome Biol 2021;22:270. [PMID: 34544477 PMCID: PMC8450716 DOI: 10.1186/s13059-021-02490-0] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Accepted: 09/08/2021] [Indexed: 02/03/2024] Open

Xu M, Guo L, Du X, Li L, Peters BA, Deng L, Wang O, Chen F, Wang J, Jiang Z, Han J, Ni M, Yang H, Xu X, Liu X, Huang J, Fan G. Accurate Haplotype-Resolved Assembly Reveals The Origin Of Structural Variants For Human Trios. Bioinformatics 2021;37:2095-2102. [PMID: 33538292 PMCID: PMC8613828 DOI: 10.1093/bioinformatics/btab068] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Revised: 12/07/2020] [Accepted: 01/28/2021] [Indexed: 11/13/2022] Open

Criscuolo A. On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference. F1000Res 2020;9:1309. [PMID: 33335719 PMCID: PMC7713896 DOI: 10.12688/f1000research.26930.1] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/12/2020] [Indexed: 12/29/2022] Open

Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 2020;21:245. [PMID: 32928274 PMCID: PMC7488777 DOI: 10.1186/s13059-020-02134-9] [Citation(s) in RCA: 617] [Impact Index Per Article: 154.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Accepted: 08/06/2020] [Indexed: 01/26/2023] Open

Nurk S, Walenz BP, Rhie A, Vollger MR, Logsdon GA, Grothe R, Miga KH, Eichler EE, Phillippy AM, Koren S. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res 2020;30:1291-1305. [PMID: 32801147 PMCID: PMC7545148 DOI: 10.1101/gr.263566.120] [Citation(s) in RCA: 315] [Impact Index Per Article: 78.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Accepted: 08/04/2020] [Indexed: 12/14/2022]

Bermúdez-Barrientos JR, Ramírez-Sánchez O, Chow FWN, Buck AH, Abreu-Goodger C. Disentangling sRNA-Seq data to study RNA communication between species. Nucleic Acids Res 2020;48:e21. [PMID: 31879784 PMCID: PMC7038986 DOI: 10.1093/nar/gkz1198] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Revised: 11/23/2019] [Accepted: 12/18/2019] [Indexed: 12/28/2022] Open

Borrayo E, May-Canche I, Paredes O, Morales JA, Romo-Vázquez R, Vélez-Pérez H. Whole-Genome k-mer Topic Modeling AssociatesBacterial Families. Genes (Basel) 2020;11:genes11020197. [PMID: 32075081 PMCID: PMC7074292 DOI: 10.3390/genes11020197] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Revised: 02/07/2020] [Accepted: 02/09/2020] [Indexed: 11/16/2022] Open

Criscuolo A. A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies. RESEARCH IDEAS AND OUTCOMES 2019. [DOI: 10.3897/rio.5.e36178] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open

Sarmashghi S, Bohmann K, P. Gilbert MT, Bafna V, Mirarab S. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biol 2019;20:34. [PMID: 30760303 PMCID: PMC6374904 DOI: 10.1186/s13059-019-1632-4] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2018] [Accepted: 01/16/2019] [Indexed: 01/10/2023] Open

Choi I, Ponsero AJ, Bomhoff M, Youens-Clark K, Hartman JH, Hurwitz BL. Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons. Gigascience 2019;8:5266304. [PMID: 30597002 PMCID: PMC6354030 DOI: 10.1093/gigascience/giy165] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Accepted: 12/17/2018] [Indexed: 11/23/2022] Open

Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM. De novo assembly of haplotype-resolved genomes with trio binning. Nat Biotechnol 2018;36:nbt.4277. [PMID: 30346939 PMCID: PMC6476705 DOI: 10.1038/nbt.4277] [Citation(s) in RCA: 248] [Impact Index Per Article: 41.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2018] [Accepted: 09/10/2018] [Indexed: 12/20/2022]

Wang Z, Lou H, Wang Y, Shamir R, Jiang R, Chen T. GePMI: A statistical model for personal intestinal microbiome identification. NPJ Biofilms Microbiomes 2018;4:20. [PMID: 30210803 PMCID: PMC6123480 DOI: 10.1038/s41522-018-0065-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2018] [Revised: 07/19/2018] [Accepted: 08/02/2018] [Indexed: 02/07/2023] Open

Wang Y, Fu L, Ren J, Yu Z, Chen T, Sun F. Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures. Front Microbiol 2018;9:872. [PMID: 29774017 PMCID: PMC5943621 DOI: 10.3389/fmicb.2018.00872] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Accepted: 04/16/2018] [Indexed: 12/19/2022] Open

Abstract

Comparing metagenomic samples is crucial for understanding microbial communities. For different groups of microbial communities, such as human gut metagenomic samples from patients with a certain disease and healthy controls, identifying group-specific sequences offers essential information for potential biomarker discovery. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered "group-specific" in our study. Our main purpose is to discover group-specific sequence regions between control and case groups as disease-associated markers. We developed a long k-mer (k ≥ 30 bps)-based computational pipeline to detect group-specific sequences at strain resolution free from reference sequences, sequence alignments, and metagenome-wide de novo assembly. We called our method MetaGO: Group-specific oligonucleotide analysis for metagenomic samples. An open-source pipeline on Apache Spark was developed with parallel computing. We applied MetaGO to one simulated and three real metagenomic datasets to evaluate the discriminative capability of identified group-specific markers. In the simulated dataset, 99.11% of group-specific logical 40-mers covered 98.89% disease-specific regions from the disease-associated strain. In addition, 97.90% of group-specific numerical 40-mers covered 99.61 and 96.39% of differentially abundant genome and regions between two groups, respectively. For a large-scale metagenomic liver cirrhosis (LC)-associated dataset, we identified 37,647 group-specific 40-mer features. Any one of the features can predict disease status of the training samples with the average of sensitivity and specificity higher than 0.8. The random forests classification using the top 10 group-specific features yielded a higher AUC (from ∼0.8 to ∼0.9) than that of previous studies. All group-specific 40-mers were present in LC patients, but not healthy controls. All the assembled 11 LC-specific sequences can be mapped to two strains of Veillonella parvula: UTDB1-3 and DSM2008. The experiments on the other two real datasets related to Inflammatory Bowel Disease and Type 2 Diabetes in Women consistently demonstrated that MetaGO achieved better prediction accuracy with fewer features compared to previous studies. The experiments showed that MetaGO is a powerful tool for identifying group-specific k-mers, which would be clinically applicable for disease prediction. MetaGO is available at https://github.com/VVsmileyx/MetaGO.

Collapse

Jia Y, Li H, Wang J, Meng H, Yang Z. Spectrum structures and biological functions of 8-mers in the human genome. Genomics 2018. [PMID: 29522801 DOI: 10.1016/j.ygeno.2018.03.006] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]

Zheng Y, Li H, Wang Y, Meng H, Zhang Q, Zhao X. Evolutionary mechanism and biological functions of 8-mers containing CG dinucleotide in yeast. Chromosome Res 2017;25:173-189. [PMID: 28181048 DOI: 10.1007/s10577-017-9554-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Revised: 12/27/2016] [Accepted: 01/27/2017] [Indexed: 01/01/2023]

Bonnici V, Manca V. Informational laws of genome structures. Sci Rep 2016;6:28840. [PMID: 27354155 PMCID: PMC4937431 DOI: 10.1038/srep28840] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Accepted: 06/09/2016] [Indexed: 01/06/2023] Open

Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 2016;17:132. [PMID: 27323842 PMCID: PMC4915045 DOI: 10.1186/s13059-016-0997-x] [Citation(s) in RCA: 1503] [Impact Index Per Article: 187.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2015] [Accepted: 06/03/2016] [Indexed: 02/07/2023] Open

Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter. J Theor Biol 2015;387:88-100. [PMID: 26427337 DOI: 10.1016/j.jtbi.2015.09.014] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2015] [Revised: 09/10/2015] [Accepted: 09/15/2015] [Indexed: 12/20/2022]

Daly GM, Leggett RM, Rowe W, Stubbs S, Wilkinson M, Ramirez-Gonzalez RH, Caccamo M, Bernal W, Heeney JL. Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data. PLoS One 2015;10:e0129059. [PMID: 26098299 PMCID: PMC4476701 DOI: 10.1371/journal.pone.0129059] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2014] [Accepted: 05/04/2015] [Indexed: 12/18/2022] Open

Rojas M, Golovko G, Khanipov K, Albayrak L, Chumakov S, Pettitt BM, Strongin AY, Fofanov Y. Secondary Analysis of the NCI-60 Whole Exome Sequencing Data Indicates Significant Presence of Propionibacterium acnes Genomic Material in Leukemia (RPMI-8226) and Central Nervous System (SF-295, SF-539, and SNB-19) Cell Lines. PLoS One 2015;10:e0127799. [PMID: 26039084 PMCID: PMC4454691 DOI: 10.1371/journal.pone.0127799] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2015] [Accepted: 04/18/2015] [Indexed: 11/25/2022] Open

Affiliation(s)

Mark Rojas Department of Pharmacology and Toxicology, University of Texas Medical Branch, Galveston, Texas, United States of America Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, Texas, United States of America * E-mail:
Georgiy Golovko Department of Pharmacology and Toxicology, University of Texas Medical Branch, Galveston, Texas, United States of America Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, Texas, United States of America
Kamil Khanipov Department of Pharmacology and Toxicology, University of Texas Medical Branch, Galveston, Texas, United States of America Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, Texas, United States of America
Levent Albayrak Department of Pharmacology and Toxicology, University of Texas Medical Branch, Galveston, Texas, United States of America Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, Texas, United States of America
Sergei Chumakov Department of Physics, University of Guadalajara, Guadalajara, Jalisco, Mexico
B. Montgomery Pettitt Department of Pharmacology and Toxicology, University of Texas Medical Branch, Galveston, Texas, United States of America Department of Biochemistry and Molecular Biology, University of Texas Medical Branch, Galveston, Texas, United States of America Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, Texas, United States of America
Alex Y. Strongin Inflammatory and Infectious Disease Center/Cancer Research Center, Sanford-Burnham Medical Research Institute, La Jolla, California, United States of America
Yuriy Fofanov Department of Pharmacology and Toxicology, University of Texas Medical Branch, Galveston, Texas, United States of America Sealy Center for Structural Biology, University of Texas Medical Branch, Galveston, Texas, United States of America

Collapse

Tabb LP, Zhao W, Huang J, Rosen GL. Characterizing the empirical distribution of prokaryotic genome n-mers in the presence of nullomers. J Comput Biol 2014;21:732-40. [PMID: 25075627 DOI: 10.1089/cmb.2014.0108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Wang Y, Leung HCM, Yiu SM, Chin FYL. MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning. BMC Genomics 2014;15 Suppl 1:S12. [PMID: 24564377 PMCID: PMC4046714 DOI: 10.1186/1471-2164-15-s1-s12] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open

Abstract

BACKGROUND

Taxonomic annotation of reads is an important problem in metagenomic analysis. Existing annotation tools, which rely on the approach of aligning each read to the taxonomic structure, are unable to annotate many reads efficiently and accurately as reads (~100 bp) are short and most of them come from unknown genomes. Previous work has suggested assembling the reads to make longer contigs before annotation. More reads/contigs can be annotated as a longer contig (in Kbp) can be aligned to a taxon even if it is from an unknown species as long as it contains a conserved region of that taxon. Unfortunately existing metagenomic assembly tools are not mature enough to produce long enough contigs. Binning tries to group reads/contigs of similar species together. Intuitively, reads in the same group (cluster) should be annotated to the same taxon and these reads altogether should cover a significant portion of the genome alleviating the problem of short contigs if the quality of binning is high. However, no existing work has tried to use binning results to help solve the annotation problem. This work explores this direction.

RESULTS

In this paper, we describe MetaCluster-TA, an assembly-assisted binning-based annotation tool which relies on an innovative idea of annotating binned reads instead of aligning each read or contig to the taxonomic structure separately. We propose the novel concept of the 'virtual contig' (which can be up to 10 Kb in length) to represent a set of reads and then represent each cluster as a set of 'virtual contigs' (which together can be total up to 1 Mb in length) for annotation. MetaCluster-TA can outperform widely-used MEGAN4 and can annotate (1) more reads since the virtual contigs are much longer; (2) more accurately since each cluster of long virtual contigs contains global information of the sampled genome which tends to be more accurate than short reads or assembled contigs which contain only local information of the genome; and (3) more efficiently since there are much fewer long virtual contigs to align than short reads. MetaCluster-TA outperforms MetaCluster 5.0 as a binning tool since binning itself can be more sensitive and precise given long virtual contigs and the binning results can be improved using the reference taxonomic database.

CONCLUSIONS

MetaCluster-TA can outperform widely-used MEGAN4 and can annotate more reads with higher accuracy and higher efficiency. It also outperforms MetaCluster 5.0 as a binning tool.

Collapse

Solovyov A, Lipkin WI. Centroid based clustering of high throughput sequencing reads based on n-mer counts. BMC Bioinformatics 2013;14:268. [PMID: 24011402 PMCID: PMC3848435 DOI: 10.1186/1471-2105-14-268] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2013] [Accepted: 09/05/2013] [Indexed: 11/10/2022] Open

Bystrykh LV. A combinatorial approach to the restriction of a mouse genome. BMC Res Notes 2013;6:284. [PMID: 23875927 PMCID: PMC3724700 DOI: 10.1186/1756-0500-6-284] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2012] [Accepted: 07/16/2013] [Indexed: 11/10/2022] Open

Wang Y, Leung HCM, Yiu SM, Chin FYL. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics 2013;28:i356-i362. [PMID: 22962452 PMCID: PMC3436824 DOI: 10.1093/bioinformatics/bts397] [Citation(s) in RCA: 97] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open

Castellini A, Franco G, Manca V. A dictionary based informational genome analysis. BMC Genomics 2012;13:485. [PMID: 22985068 PMCID: PMC3577435 DOI: 10.1186/1471-2164-13-485] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2012] [Accepted: 08/28/2012] [Indexed: 11/16/2022] Open

Peng Y, Leung HCM, Yiu SM, Chin FYL. Meta-IDBA: a de Novo assembler for metagenomic data. Bioinformatics 2011;27:i94-101. [PMID: 21685107 PMCID: PMC3117360 DOI: 10.1093/bioinformatics/btr216] [Citation(s) in RCA: 237] [Impact Index Per Article: 18.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Davenport CF, Tümmler B. Abundant oligonucleotides common to most bacteria. PLoS One 2010;5:e9841. [PMID: 20352124 PMCID: PMC2843746 DOI: 10.1371/journal.pone.0009841] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2009] [Accepted: 03/03/2010] [Indexed: 11/25/2022] Open

Chor B, Horn D, Goldman N, Levy Y, Massingham T. Genomic DNA k-mer spectra: models and modalities. Genome Biol 2009;10:R108. [PMID: 19814784 PMCID: PMC2784323 DOI: 10.1186/gb-2009-10-10-r108] [Citation(s) in RCA: 120] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2009] [Revised: 08/14/2009] [Accepted: 10/08/2009] [Indexed: 11/12/2022] Open

Acquisti C, Poste G, Curtiss D, Kumar S. Nullomers: really a matter of natural selection? PLoS One 2007;2:e1022. [PMID: 17925870 PMCID: PMC1995752 DOI: 10.1371/journal.pone.0001022] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2007] [Accepted: 09/19/2007] [Indexed: 11/23/2022] Open

King BR, Guda C. ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes. Genome Biol 2007;8:R68. [PMID: 17472741 PMCID: PMC1929137 DOI: 10.1186/gb-2007-8-5-r68] [Citation(s) in RCA: 86] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2006] [Revised: 02/19/2007] [Accepted: 05/01/2007] [Indexed: 11/23/2022] Open

Reed C, Fofanov V, Putonti C, Chumakov S, Slezak T, Fofanov Y. Effect of the mutation rate and background size on the quality of pathogen identification. ACTA ACUST UNITED AC 2007;23:2665-71. [PMID: 17881407 DOI: 10.1093/bioinformatics/btm420] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Salerno W, Havlak P, Miller J. Scale-invariant structure of strongly conserved sequence in genomic intersections and alignments. Proc Natl Acad Sci U S A 2006;103:13121-5. [PMID: 16924100 PMCID: PMC1559763 DOI: 10.1073/pnas.0605735103] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open

Tran T, Havlak P, Miller J. MicroRNA enrichment among short 'ultraconserved' sequences in insects. Nucleic Acids Res 2006;34:e65. [PMID: 16698958 PMCID: PMC3303174 DOI: 10.1093/nar/gkl173] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open

Putonti C, Chumakov S, Mitra R, Fox GE, Willson RC, Fofanov Y. Human-blind probes and primers for dengue virus identification. FEBS J 2006;273:398-408. [PMID: 16403026 DOI: 10.1111/j.1742-4658.2005.05074.x] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]

Gangal R, Sharma P. Human pol II promoter prediction: time series descriptors and machine learning. Nucleic Acids Res 2005;33:1332-6. [PMID: 15741185 PMCID: PMC552959 DOI: 10.1093/nar/gki271] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2004] [Revised: 02/08/2005] [Accepted: 02/08/2005] [Indexed: 11/14/2022] Open

Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2005. [PMCID: PMC2447482 DOI: 10.1002/cfg.421] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open