1
|
Şapcı AOB, Mirarab S. Memory-bound k-mer selection for large and evolutionarily diverse reference libraries. Genome Res 2024; 34:1455-1467. [PMID: 39209553 PMCID: PMC11529837 DOI: 10.1101/gr.279339.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Accepted: 08/06/2024] [Indexed: 09/04/2024]
Abstract
Using k-mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these downstream applications relies on the density of the reference databases, which are rapidly growing. Although the increased density provides hope for improvements in accuracy, scalability is a concern. Reference k-mers are kept in the memory during the query time, and saving all k-mers of these ever-expanding databases is fast becoming impractical. Several strategies for subsampling have been proposed, including minimizers and finding taxon-specific k-mers. However, we contend that these strategies are inadequate, especially when reference sets are taxonomically imbalanced, as are most microbial libraries. In this paper, we explore approaches for selecting a fixed-size subset of k-mers present in an ultra-large data set to include in a library such that the classification of reads suffers the least. Our experiments demonstrate the limitations of existing approaches, especially for novel and poorly sampled groups. We propose a library construction algorithm called k-mer RANKer (KRANK) that combines several components, including a hierarchical selection strategy with adaptive size restrictions and an equitable coverage strategy. We implement KRANK in highly optimized code and combine it with the locality-sensitive hashing classifier CONSULT-II to build a taxonomic classification and profiling method. On several benchmarks, KRANK k-mer selection significantly reduces memory consumption with minimal loss in classification accuracy. We show in extensive analyses based on CAMI benchmarks that KRANK outperforms k-mer-based alternatives in terms of taxonomic profiling and comes close to the best marker-based methods in terms of accuracy.
Collapse
Affiliation(s)
- Ali Osman Berk Şapcı
- Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, California 92093, USA
| | - Siavash Mirarab
- Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, California 92093, USA;
- Department of Electrical and Computer Engineering, University of California, San Diego, California 92093, USA
| |
Collapse
|
2
|
Edwards SV, Cloutier A, Cockburn G, Driver R, Grayson P, Katoh K, Baldwin MW, Sackton TB, Baker AJ. A nuclear genome assembly of an extinct flightless bird, the little bush moa. SCIENCE ADVANCES 2024; 10:eadj6823. [PMID: 38781323 PMCID: PMC11809649 DOI: 10.1126/sciadv.adj6823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Accepted: 04/17/2024] [Indexed: 05/25/2024]
Abstract
We present a draft genome of the little bush moa (Anomalopteryx didiformis)-one of approximately nine species of extinct flightless birds from Aotearoa, New Zealand-using ancient DNA recovered from a fossil bone from the South Island. We recover a complete mitochondrial genome at 249.9× depth of coverage and almost 900 megabases of a male moa nuclear genome at ~4 to 5× coverage, with sequence contiguity sufficient to identify more than 85% of avian universal single-copy orthologs. We describe a diverse landscape of transposable elements and satellite repeats, estimate a long-term effective population size of ~240,000, identify a diverse suite of olfactory receptor genes and an opsin repertoire with sensitivity in the ultraviolet range, show that the wingless moa phenotype is likely not attributable to gene loss or pseudogenization, and identify potential function-altering coding sequence variants in moa that could be synthesized for future functional assays. This genomic resource should support further studies of avian evolution and morphological divergence.
Collapse
Affiliation(s)
- Scott V. Edwards
- Department of Organismic and Evolutionary Biology, Harvard University, 26 Oxford Street, Cambridge, MA 02138, USA
- Museum of Comparative Zoology, Harvard University, 26 Oxford Street, Cambridge, MA 02138, USA
| | - Alison Cloutier
- Department of Organismic and Evolutionary Biology, Harvard University, 26 Oxford Street, Cambridge, MA 02138, USA
| | - Glenn Cockburn
- Evolution of Sensory Systems Research Group, Max Planck Institute for Biological Intelligence, 82319 Seewiesen, Germany
| | - Robert Driver
- Department of Biology, East Carolina University, E 5th Street, Greenville, NC 27605, USA
| | - Phil Grayson
- Department of Organismic and Evolutionary Biology, Harvard University, 26 Oxford Street, Cambridge, MA 02138, USA
- Museum of Comparative Zoology, Harvard University, 26 Oxford Street, Cambridge, MA 02138, USA
| | - Kazutaka Katoh
- Department of Genome Informatics, Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita 565-0871, Japan
| | - Maude W. Baldwin
- Evolution of Sensory Systems Research Group, Max Planck Institute for Biological Intelligence, 82319 Seewiesen, Germany
| | - Timothy B. Sackton
- Informatics Group, Harvard University, 38 Oxford Street, Cambridge, MA 02138, USA
| | - Allan J. Baker
- Department of Ecology and Evolutionary Biology, University of Toronto, 25 Willcox Street, Toronto, ON M5S 3B2, Canada
- Department of Natural History, Royal Ontario Museum, 100 Queen’s Park, Toronto, ON M5S 2C6, Canada
| |
Collapse
|
3
|
Şapcı AOB, Rachtman E, Mirarab S. CONSULT-II: accurate taxonomic identification and profiling using locality-sensitive hashing. Bioinformatics 2024; 40:btae150. [PMID: 38492564 PMCID: PMC10985673 DOI: 10.1093/bioinformatics/btae150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 02/17/2024] [Accepted: 03/14/2024] [Indexed: 03/18/2024] Open
Abstract
MOTIVATION Taxonomic classification of short reads and taxonomic profiling of metagenomic samples are well-studied yet challenging problems. The presence of species belonging to groups without close representation in a reference dataset is particularly challenging. While k-mer-based methods have performed well in terms of running time and accuracy, they tend to have reduced accuracy for such novel species. Thus, there is a growing need for methods that combine the scalability of k-mers with increased sensitivity. RESULTS Here, we show that using locality-sensitive hashing (LSH) can increase the sensitivity of the k-mer-based search. Our method, which combines LSH with several heuristics techniques including soft lowest common ancestor labeling and voting, is more accurate than alternatives in both taxonomic classification of individual reads and abundance profiling. AVAILABILITY AND IMPLEMENTATION CONSULT-II is implemented in C++, and the software, together with reference libraries, is publicly available on GitHub https://github.com/bo1929/CONSULT-II.
Collapse
Affiliation(s)
- Ali Osman Berk Şapcı
- Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, CA 92093, United States
| | - Eleonora Rachtman
- Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, CA 92093, United States
| | - Siavash Mirarab
- Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, CA 92093, United States
- Department of Electrical and Computer Engineering, University of California, San Diego, CA 92093, United States
| |
Collapse
|
4
|
Chorlton SD. Ten common issues with reference sequence databases and how to mitigate them. FRONTIERS IN BIOINFORMATICS 2024; 4:1278228. [PMID: 38560517 PMCID: PMC10978663 DOI: 10.3389/fbinf.2024.1278228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 03/05/2024] [Indexed: 04/04/2024] Open
Abstract
Metagenomic sequencing has revolutionized our understanding of microbiology. While metagenomic tools and approaches have been extensively evaluated and benchmarked, far less attention has been given to the reference sequence database used in metagenomic classification. Issues with reference sequence databases are pervasive. Database contamination is the most recognized issue in the literature; however, it remains relatively unmitigated in most analyses. Other common issues with reference sequence databases include taxonomic errors, inappropriate inclusion and exclusion criteria, and sequence content errors. This review covers ten common issues with reference sequence databases and the potential downstream consequences of these issues. Mitigation measures are discussed for each issue, including bioinformatic tools and database curation strategies. Together, these strategies present a path towards more accurate, reproducible and translatable metagenomic sequencing.
Collapse
|
5
|
Bálint B, Merényi Z, Hegedüs B, Grigoriev IV, Hou Z, Földi C, Nagy LG. ContScout: sensitive detection and removal of contamination from annotated genomes. Nat Commun 2024; 15:936. [PMID: 38296951 PMCID: PMC10831095 DOI: 10.1038/s41467-024-45024-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 01/08/2024] [Indexed: 02/02/2024] Open
Abstract
Contamination of genomes is an increasingly recognized problem affecting several downstream applications, from comparative evolutionary genomics to metagenomics. Here we introduce ContScout, a precise tool for eliminating foreign sequences from annotated genomes. It achieves high specificity and sensitivity on synthetic benchmark data even when the contaminant is a closely related species, outperforms competing tools, and can distinguish horizontal gene transfer from contamination. A screen of 844 eukaryotic genomes for contamination identified bacteria as the most common source, followed by fungi and plants. Furthermore, we show that contaminants in ancestral genome reconstructions lead to erroneous early origins of genes and inflate gene loss rates, leading to a false notion of complex ancestral genomes. Taken together, we offer here a tool for sensitive removal of foreign proteins, identify and remove contaminants from diverse eukaryotic genomes and evaluate their impact on phylogenomic analyses.
Collapse
Affiliation(s)
- Balázs Bálint
- Synthetic and Systems Biology Unit, HUN-REN Biological Research Centre, Szeged, Szeged, 6726, Hungary
| | - Zsolt Merényi
- Synthetic and Systems Biology Unit, HUN-REN Biological Research Centre, Szeged, Szeged, 6726, Hungary
| | - Botond Hegedüs
- Synthetic and Systems Biology Unit, HUN-REN Biological Research Centre, Szeged, Szeged, 6726, Hungary
| | - Igor V Grigoriev
- U.S. Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA, 94720, USA
| | - Zhihao Hou
- Synthetic and Systems Biology Unit, HUN-REN Biological Research Centre, Szeged, Szeged, 6726, Hungary
- Doctoral School of Biology, Faculty of Science and Informatics, University of Szeged, Szeged, 6720, Hungary
| | - Csenge Földi
- Synthetic and Systems Biology Unit, HUN-REN Biological Research Centre, Szeged, Szeged, 6726, Hungary
- Doctoral School of Biology, Faculty of Science and Informatics, University of Szeged, Szeged, 6720, Hungary
| | - László G Nagy
- Synthetic and Systems Biology Unit, HUN-REN Biological Research Centre, Szeged, Szeged, 6726, Hungary.
| |
Collapse
|
6
|
Alvarez RV, Landsman D. GTax: improving de novo transcriptome assembly by removing foreign RNA contamination. Genome Biol 2024; 25:12. [PMID: 38191464 PMCID: PMC10773103 DOI: 10.1186/s13059-023-03141-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2022] [Accepted: 12/08/2023] [Indexed: 01/10/2024] Open
Abstract
The cost and complexity of generating a complete reference genome means that many organisms lack an annotated reference. An alternative is to use a de novo reference transcriptome. This technology is cost-effective but is susceptible to off-target RNA contamination. In this manuscript, we present GTax, a taxonomy-structured database of genomic sequences that can be used with BLAST to detect and remove foreign contamination in RNA sequencing samples before assembly. In addition, we use a de novo transcriptome assembly of Solanum lycopersicum (tomato) to demonstrate that removing foreign contamination in sequencing samples reduces the number of assembled chimeric transcripts.
Collapse
Affiliation(s)
- Roberto Vera Alvarez
- Computational Biology Branch, National Center for Biotechnology Information, Intramural Research Program, National Library of Medicine, NIH, Bethesda, MD, USA
| | - David Landsman
- Computational Biology Branch, National Center for Biotechnology Information, Intramural Research Program, National Library of Medicine, NIH, Bethesda, MD, USA.
| |
Collapse
|
7
|
Mirarab S, Bafna V. Analyses of Nuclear Reads Obtained Using Genome Skimming. Methods Mol Biol 2024; 2744:247-265. [PMID: 38683324 DOI: 10.1007/978-1-0716-3581-0_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/01/2024]
Abstract
In this protocol paper, we review a set of methods developed in recent years for analyzing nuclear reads obtained from genome skimming. As the cost of sequencing drops, genome skimming (low-coverage shotgun sequencing of a sample) becomes increasingly a cost-effective method of measuring biodiversity at high resolution. While most practitioners only use assembled over-represented organelle reads from a genome skim, the vast majority of the reads are nuclear. Using assembly-free and alignment-free methods described in this protocol, we can compare samples to each other and reference genomes to compute distances, characterize underlying genomes, and infer evolutionary relationships.
Collapse
Affiliation(s)
- Siavash Mirarab
- Electrical and Computer Engineering, University of California-San Diego, La Jolla, CA, USA.
| | - Vineet Bafna
- Computer Science and Engineering, University of California-San Diego, La Jolla, CA, USA
| |
Collapse
|
8
|
Zheng H, Marçais G, Kingsford C. Creating and Using Minimizer Sketches in Computational Genomics. J Comput Biol 2023; 30:1251-1276. [PMID: 37646787 PMCID: PMC11082048 DOI: 10.1089/cmb.2023.0094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023] Open
Abstract
Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computer Science Department, Princeton University, Princeton, New Jersey, USA
| | - Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
9
|
Rumbavicius I, Rounge TB, Rognes T. HoCoRT: host contamination removal tool. BMC Bioinformatics 2023; 24:371. [PMID: 37784008 PMCID: PMC10544359 DOI: 10.1186/s12859-023-05492-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Accepted: 09/21/2023] [Indexed: 10/04/2023] Open
Abstract
BACKGROUND Shotgun metagenome sequencing data obtained from a host environment will usually be contaminated with sequences from the host organism. Host sequences should be removed before further analysis to avoid biases, reduce downstream computational load, or ensure privacy in the case of a human host. The tools that we identified, as designed specifically to perform host contamination sequence removal, were either outdated, not maintained, or complicated to use. Consequently, we have developed HoCoRT, a fast and user-friendly tool that implements several methods for optimised host sequence removal. We have evaluated the speed and accuracy of these methods. RESULTS HoCoRT is an open-source command-line tool for host contamination removal. It is designed to be easy to install and use, offering a one-step option for genome indexing. HoCoRT employs a variety of well-known mapping, classification, and alignment methods to classify reads. The user can select the underlying classification method and its parameters, allowing adaptation to different scenarios. Based on our investigation of various methods and parameters using synthetic human gut and oral microbiomes, and on assessment of publicly available data, we provide recommendations for typical datasets with short and long reads. CONCLUSIONS To decontaminate a human gut microbiome with short reads using HoCoRT, we found the optimal combination of speed and accuracy with BioBloom, Bowtie2 in end-to-end mode, and HISAT2. Kraken2 consistently demonstrated the highest speed, albeit with a trade-off in accuracy. The same applies to an oral microbiome, but here Bowtie2 was notably slower than the other tools. For long reads, the detection of human host reads is more difficult. In this case, a combination of Kraken2 and Minimap2 achieved the highest accuracy and detected 59% of human reads. In comparison to the dedicated DeconSeq tool, HoCoRT using Bowtie2 in end-to-end mode proved considerably faster and slightly more accurate. HoCoRT is available as a Bioconda package, and the source code can be accessed at https://github.com/ignasrum/hocort along with the documentation. It is released under the MIT licence and is compatible with Linux and macOS (except for the BioBloom module).
Collapse
Affiliation(s)
- Ignas Rumbavicius
- Centre for Bioinformatics, Department of Informatics, University of Oslo, PO Box 1080 Blindern, 0316, Oslo, Norway
| | - Trine B Rounge
- Centre for Bioinformatics, Department of Pharmacy, University of Oslo, PO Box 1068 Blindern, 0316, Oslo, Norway.
- Cancer Registry of Norway, PO Box 5313 Majorstuen, 0304, Oslo, Norway.
| | - Torbjørn Rognes
- Centre for Bioinformatics, Department of Informatics, University of Oslo, PO Box 1080 Blindern, 0316, Oslo, Norway.
- Department of Microbiology, Oslo University Hospital, PO Box 4950 Nydalen, 0424, Oslo, Norway.
| |
Collapse
|
10
|
Rachtman E, Sarmashghi S, Bafna V, Mirarab S. Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling. Cell Syst 2022; 13:817-829.e3. [PMID: 36265468 PMCID: PMC9589918 DOI: 10.1016/j.cels.2022.06.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 03/14/2022] [Accepted: 06/28/2022] [Indexed: 01/26/2023]
Abstract
Computing distance between two genomes without alignments or even access to assemblies has many downstream analyses. However, alignment-free methods, including in the fast-growing field of genome skimming, are hampered by a significant methodological gap. While accurate methods (many k-mer-based) for assembly-free distance calculation exist, measuring the uncertainty of estimated distances has not been sufficiently studied. In this paper, we show that bootstrapping, the standard non-parametric method of measuring estimator uncertainty, is not accurate for k-mer-based methods that rely on k-mer frequency profiles. Instead, we propose using subsampling (with no replacement) in combination with a correction step to reduce the variance of the inferred distribution. We show that the distribution of distances using our procedure matches the true uncertainty of the estimator. The resulting phylogenetic support values effectively differentiate between correct and incorrect branches and identify controversial branches that change across alignment-free and alignment-based phylogenies reported in the literature.
Collapse
Affiliation(s)
- Eleonora Rachtman
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, San Diego, CA 92093, USA
| | - Shahab Sarmashghi
- Department of Electrical and Computer Engineering, UC San Diego, San Diego, CA 92093, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, UC San Diego, San Diego, CA 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, San Diego, CA 92093, USA.
| |
Collapse
|
11
|
Jin P, Dai J, Guo Y, Wang X, Lu J, Zhu Y, Yu F. Genomic Analysis of Mycobacterium abscessus Complex Isolates from Patients with Pulmonary Infection in China. Microbiol Spectr 2022; 10:e0011822. [PMID: 35863029 PMCID: PMC9430165 DOI: 10.1128/spectrum.00118-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Accepted: 06/19/2022] [Indexed: 11/20/2022] Open
Abstract
Members of the Mycobacterium abscessus complex (MABC) are multidrug-resistant nontuberculous mycobacteria and increasingly cause opportunistic pulmonary infections. However, the genetic typing of MABC isolates remains largely unclear in China. Genomic analyses were conducted for 69 MABC clinical isolates obtained from patients with lower respiratory tract infections in Shanghai Pulmonary Hospital between 2014 and 2016. The draft genomes of the 69 clinical strains were assembled, with a total length of 4.5 to 5.6 Mb, a percent GC content (GC%) ranging from 63.9 to 68.1%, and 4,492 to 5,404 genes per genome. Susceptibility test shows that most isolates are resistant to many antimicrobials, including clarithromycin, but susceptible to tigecycline. Analyses revealed the presence of genes conferring resistance to antibiotics, including macrolides, aminoglycosides, rifampicin, and tetracyclines. Furthermore, 80 to 114 virulence genes were identified per genome, including those related to the invasion of macrophages, iron incorporation, and avoidance of immune clearance. Mobile genetic elements, including insertion sequences, transposons, and genomic islands, were discovered in the genomes. Phylogenetic analyses of all MABC isolates with another 41 complete MABC genomes identified three clades; 46 isolates were clustered in clade I, corresponding to M. abscessus subsp. abscessus, and 25 strains belonged to existing clonal complexes. Overall, this is the first comparative genomic analysis of MABC clinical isolates in China. These results show significant intraspecies variations in genetic determinants encoding antimicrobial resistance, virulence, and mobile elements and controversial subspecies classification using current marker gene combinations. This information will be useful in understanding the evolution, antimicrobial resistance, and pathogenesis of MABC strains and facilitating future vaccine development and drug design. IMPORTANCE Over the past decade, infections by Mycobacterium abscessus complex (MABC) isolates have been increasingly reported worldwide. MABC strains often show a high incidence in cystic fibrosis (CF) patients, whereas in Asia, these strains are frequently recovered from non-CF patients with significant genomic diversity. The present work involves analyses of the antimicrobial resistance, virulence, and phylogeny of 69 selected MABC isolates from non-CF pulmonary patients in Shanghai Pulmonary Hospital by whole-genome sequencing; it represents the first comprehensive investigation of MABC strains in China at the genomic level. These findings highlight the diversity of this group of nontuberculous mycobacteria and provide a mechanistic understanding of evolution and pathogenesis, which is valuable for the development of novel and effective antimicrobial therapies for deadly MABC infections in China.
Collapse
Affiliation(s)
- Peipei Jin
- Department of Laboratory Medicine, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Jing Dai
- Department of Laboratory Medicine, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Yinjuan Guo
- Department of Laboratory Medicine, Shanghai Pulmonary Hospital, Tongji University School of Medicine, Shanghai, China
| | - Xuefeng Wang
- Department of Laboratory Medicine, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Jing Lu
- Department of Biochemistry and Pharmacology, University of Melbourne, Melbourne, Victoria, Australia
| | - Yan Zhu
- Immunity and Infection Program, Department of Microbiology, Biomedicine Discovery Institute, Monash University, Melbourne, Victoria, Australia
| | - Fangyou Yu
- Department of Laboratory Medicine, Shanghai Pulmonary Hospital, Tongji University School of Medicine, Shanghai, China
| |
Collapse
|
12
|
Cornet L, Baurain D. Contamination detection in genomic data: more is not enough. Genome Biol 2022; 23:60. [PMID: 35189924 PMCID: PMC8862208 DOI: 10.1186/s13059-022-02619-9] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 01/18/2022] [Indexed: 12/20/2022] Open
Abstract
The decreasing cost of sequencing and concomitant augmentation of publicly available genomes have created an acute need for automated software to assess genomic contamination. During the last 6 years, 18 programs have been published, each with its own strengths and weaknesses. Deciding which tools to use becomes more and more difficult without an understanding of the underlying algorithms. We review these programs, benchmarking six of them, and present their main operating principles. This article is intended to guide researchers in the selection of appropriate tools for specific applications. Finally, we present future challenges in the developing field of contamination detection.
Collapse
Affiliation(s)
- Luc Cornet
- BCCM/IHEM, Mycology and Aerobiology, Sciensano, Bruxelles, Belgium
| | - Denis Baurain
- InBioS-PhytoSYSTEMS, Eukaryotic Phylogenomics, University of Liège, Liège, Belgium.
| |
Collapse
|
13
|
Sarmashghi S, Balaban M, Rachtman E, Touri B, Mirarab S, Bafna V. Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT. PLoS Comput Biol 2021; 17:e1009449. [PMID: 34780468 PMCID: PMC8629397 DOI: 10.1371/journal.pcbi.1009449] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 11/29/2021] [Accepted: 09/13/2021] [Indexed: 01/26/2023] Open
Abstract
The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=. The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome skims) could be transformative for genomic ecology. Analyzing genome skims, mostly based on statistics of small oligomers, remains challenging, but recent results have shown the advantage of this approach for the identification and phylogenetic placement of eukaryotic species. In this paper, we present a method, RESPECT, to estimate genomic properties such as genome length and repetitiveness from low-coverage genome skims. We trained RESPECT using assembled genomes and tested it on low-coverage simulated and real reads. Benchmarking results reveal that RESPECT has excellent accuracy in estimating the genome length compared to other methods, and can provide critical information regarding the repeat structure of the genome.
Collapse
Affiliation(s)
- Shahab Sarmashghi
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, California, United States of America
| | - Metin Balaban
- Bioinformatics & Systems Biology Graduate Program, University of California, San Diego, La Jolla, California, United States of America
| | - Eleonora Rachtman
- Bioinformatics & Systems Biology Graduate Program, University of California, San Diego, La Jolla, California, United States of America
| | - Behrouz Touri
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, California, United States of America
| | - Siavash Mirarab
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, California, United States of America
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, California, United States of America
- * E-mail:
| |
Collapse
|