1
|
Pinto Y, Bhatt AS. Sequencing-based analysis of microbiomes. Nat Rev Genet 2024; 25:829-845. [PMID: 38918544 DOI: 10.1038/s41576-024-00746-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/15/2024] [Indexed: 06/27/2024]
Abstract
Microbiomes occupy a range of niches and, in addition to having diverse compositions, they have varied functional roles that have an impact on agriculture, environmental sciences, and human health and disease. The study of microbiomes has been facilitated by recent technological and analytical advances, such as cheaper and higher-throughput DNA and RNA sequencing, improved long-read sequencing and innovative computational analysis methods. These advances are providing a deeper understanding of microbiomes at the genomic, transcriptional and translational level, generating insights into their function and composition at resolutions beyond the species level.
Collapse
Affiliation(s)
- Yishay Pinto
- Department of Genetics, Stanford University, Stanford, CA, USA
- Department of Medicine, Divisions of Hematology and Blood & Marrow Transplantation, Stanford University, Stanford, CA, USA
| | - Ami S Bhatt
- Department of Genetics, Stanford University, Stanford, CA, USA.
- Department of Medicine, Divisions of Hematology and Blood & Marrow Transplantation, Stanford University, Stanford, CA, USA.
| |
Collapse
|
2
|
Shaw J, Gounot JS, Chen H, Nagarajan N, Yu YW. Floria: fast and accurate strain haplotyping in metagenomes. Bioinformatics 2024; 40:i30-i38. [PMID: 38940183 PMCID: PMC11211831 DOI: 10.1093/bioinformatics/btae252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
SUMMARY Shotgun metagenomics allows for direct analysis of microbial community genetics, but scalable computational methods for the recovery of bacterial strain genomes from microbiomes remains a key challenge. We introduce Floria, a novel method designed for rapid and accurate recovery of strain haplotypes from short and long-read metagenome sequencing data, based on minimum error correction (MEC) read clustering and a strain-preserving network flow model. Floria can function as a standalone haplotyping method, outputting alleles and reads that co-occur on the same strain, as well as an end-to-end read-to-assembly pipeline (Floria-PL) for strain-level assembly. Benchmarking evaluations on synthetic metagenomes show that Floria is > 3× faster and recovers 21% more strain content than base-level assembly methods (Strainberry) while being over an order of magnitude faster when only phasing is required. Applying Floria to a set of 109 deeply sequenced nanopore metagenomes took <20 min on average per sample and identified several species that have consistent strain heterogeneity. Applying Floria's short-read haplotyping to a longitudinal gut metagenomics dataset revealed a dynamic multi-strain Anaerostipes hadrus community with frequent strain loss and emergence events over 636 days. With Floria, accurate haplotyping of metagenomic datasets takes mere minutes on standard workstations, paving the way for extensive strain-level metagenomic analyses. AVAILABILITY AND IMPLEMENTATION Floria is available at https://github.com/bluenote-1577/floria, and the Floria-PL pipeline is available at https://github.com/jsgounot/Floria_analysis_workflow along with code for reproducing the benchmarks.
Collapse
Affiliation(s)
- Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, Ontario, M5S 2E4, Canada
| | - Jean-Sebastien Gounot
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Singapore, 138672, Republic of Singapore
| | - Hanrong Chen
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Singapore, 138672, Republic of Singapore
| | - Niranjan Nagarajan
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Singapore, 138672, Republic of Singapore
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 117597, Republic of Singapore
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, Ontario, M5S 2E4, Canada
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, United States
| |
Collapse
|
3
|
Zhu X, Zhao L, Huang L, Yang W, Wang L, Yu R. cgMSI: pathogen detection within species from nanopore metagenomic sequencing data. BMC Bioinformatics 2023; 24:387. [PMID: 37821827 PMCID: PMC10568937 DOI: 10.1186/s12859-023-05512-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2022] [Accepted: 10/02/2023] [Indexed: 10/13/2023] Open
Abstract
BACKGROUND Metagenomic sequencing is an unbiased approach that can potentially detect all the known and unidentified strains in pathogen detection. Recently, nanopore sequencing has been emerging as a highly potential tool for rapid pathogen detection due to its fast turnaround time. However, identifying pathogen within species is nontrivial for nanopore sequencing data due to the high sequencing error rate. RESULTS We developed the core gene alleles metagenome strain identification (cgMSI) tool, which uses a two-stage maximum a posteriori probability estimation method to detect pathogens at strain level from nanopore metagenomic sequencing data at low computational cost. The cgMSI tool can accurately identify strains and estimate relative abundance at 1× coverage. CONCLUSIONS We developed cgMSI for nanopore metagenomic pathogen detection within species. cgMSI is available at https://github.com/ZHU-XU-xmu/cgMSI .
Collapse
Affiliation(s)
- Xu Zhu
- School of Informatics, Xiamen University, Xiamen, Fujian, China
| | - Lili Zhao
- Women and Children's Hospital, School of Medicine, Xiamen University, Xiamen, Fujian, China
| | - Lihong Huang
- Computer Management Center, The First Affiliated Hospital of Xiamen University, Xiamen, Fujian, China
| | | | - Liansheng Wang
- School of Informatics, Xiamen University, Xiamen, Fujian, China.
- National Institute for Data Science in Health and Medicine, Informatics, Xiamen University, Xiamen, Fujian, China.
| | - Rongshan Yu
- School of Informatics, Xiamen University, Xiamen, Fujian, China.
- National Institute for Data Science in Health and Medicine, Informatics, Xiamen University, Xiamen, Fujian, China.
| |
Collapse
|
4
|
Zhao C, Shi ZJ, Pollard KS. Pitfalls of genotyping microbial communities with rapidly growing genome collections. Cell Syst 2023; 14:160-176.e3. [PMID: 36657438 PMCID: PMC9957970 DOI: 10.1016/j.cels.2022.12.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 10/15/2022] [Accepted: 12/19/2022] [Indexed: 01/20/2023]
Abstract
Detecting genetic variants in metagenomic data is a priority for understanding the evolution, ecology, and functional characteristics of microbial communities. Many tools that perform this metagenotyping rely on aligning reads of unknown origin to a database of sequences from many species before calling variants. In this synthesis, we investigate how databases of increasingly diverse and closely related species have pushed the limits of current alignment algorithms, thereby degrading the performance of metagenotyping tools. We identify multi-mapping reads as a prevalent source of errors and illustrate a trade-off between retaining correct alignments versus limiting incorrect alignments, many of which map reads to the wrong species. Then we evaluate several actionable mitigation strategies and review emerging methods showing promise to further improve metagenotyping in response to the rapid growth in genome collections. Our results have implications beyond metagenotyping to the many tools in microbial genomics that depend upon accurate read mapping.
Collapse
Affiliation(s)
- Chunyu Zhao
- Chan Zuckerberg Biohub, San Francisco, CA, USA; Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA
| | - Zhou Jason Shi
- Chan Zuckerberg Biohub, San Francisco, CA, USA; Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA
| | - Katherine S Pollard
- Chan Zuckerberg Biohub, San Francisco, CA, USA; Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA; Department of Epidemiology & Biostatistics, University of California, San Francisco, San Francisco, CA, USA.
| |
Collapse
|
5
|
Ma S, Li H. Statistical and Computational Methods for Microbial Strain Analysis. Methods Mol Biol 2023; 2629:231-245. [PMID: 36929080 DOI: 10.1007/978-1-0716-2986-4_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2023]
Abstract
Microbial strains are interpreted as a lineage derived from a recent ancestor that have not experienced "too many" recombination events and can be successfully retrieved with culture-independent techniques using metagenomic sequencing. Such a strain variability has been increasingly shown to display additional phenotypic heterogeneities that affect host health, such as virulence, transmissibility, and antibiotics resistance. New statistical and computational methods have recently been developed to track the strains in samples based on shotgun metagenomics data either based on reference genome sequences or Metagenome-assembled genomes (MAGs). In this paper, we review some recent statistical methods for strain identifications based on frequency counts at a set of single nucleotide variants (SNVs) within a set of single-copy marker genes. These methods differ in terms of whether reference genome sequences are needed, how SNVs are called, what methods of deconvolution are used and whether the methods can be applied to multiple samples. We conclude our review with areas that require further research.
Collapse
Affiliation(s)
- Siyuan Ma
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
6
|
A revisit to universal single-copy genes in bacterial genomes. Sci Rep 2022; 12:14550. [PMID: 36008577 PMCID: PMC9411617 DOI: 10.1038/s41598-022-18762-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Accepted: 08/18/2022] [Indexed: 11/08/2022] Open
Abstract
Universal single-copy genes (USCGs) are widely used for species classification and taxonomic profiling. Despite many studies on USCGs, our understanding of USCGs in bacterial genomes might be out of date, especially how different the USCGs are in different studies, how well a set of USCGs can distinguish two bacterial species, whether USCGs can separate different strains of a bacterial species, to name a few. To fill the void, we studied USCGs in the most updated complete bacterial genomes. We showed that different USCG sets are quite different while coming from highly similar functional categories. We also found that although USCGs occur once in almost all bacterial genomes, each USCG does occur multiple times in certain genomes. We demonstrated that USCGs are reliable markers to distinguish different species while they cannot distinguish different strains of most bacterial species. Our study sheds new light on the usage and limitations of USCGs, which will facilitate their applications in evolutionary, phylogenomic, and metagenomic studies.
Collapse
|
7
|
Smith BJ, Li X, Shi ZJ, Abate A, Pollard KS. Scalable Microbial Strain Inference in Metagenomic Data Using StrainFacts. FRONTIERS IN BIOINFORMATICS 2022; 2:867386. [PMID: 36304283 PMCID: PMC9580935 DOI: 10.3389/fbinf.2022.867386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Accepted: 04/14/2022] [Indexed: 11/25/2022] Open
Abstract
While genome databases are nearing a complete catalog of species commonly inhabiting the human gut, their representation of intraspecific diversity is lacking for all but the most abundant and frequently studied taxa. Statistical deconvolution of allele frequencies from shotgun metagenomic data into strain genotypes and relative abundances is a promising approach, but existing methods are limited by computational scalability. Here we introduce StrainFacts, a method for strain deconvolution that enables inference across tens of thousands of metagenomes. We harness a “fuzzy” genotype approximation that makes the underlying graphical model fully differentiable, unlike existing methods. This allows parameter estimates to be optimized with gradient-based methods, speeding up model fitting by two orders of magnitude. A GPU implementation provides additional scalability. Extensive simulations show that StrainFacts can perform strain inference on thousands of metagenomes and has comparable accuracy to more computationally intensive tools. We further validate our strain inferences using single-cell genomic sequencing from a human stool sample. Applying StrainFacts to a collection of more than 10,000 publicly available human stool metagenomes, we quantify patterns of strain diversity, biogeography, and linkage-disequilibrium that agree with and expand on what is known based on existing reference genomes. StrainFacts paves the way for large-scale biogeography and population genetic studies of microbiomes using metagenomic data.
Collapse
Affiliation(s)
- Byron J. Smith
- The Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, United States
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, United States
| | - Xiangpeng Li
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, United States
| | - Zhou Jason Shi
- The Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, United States
- Chan-Zuckerberg Biohub, San Francisco, CA, United States
| | - Adam Abate
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, United States
- Chan-Zuckerberg Biohub, San Francisco, CA, United States
| | - Katherine S. Pollard
- The Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, United States
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, United States
- Chan-Zuckerberg Biohub, San Francisco, CA, United States
- *Correspondence: Katherine S. Pollard,
| |
Collapse
|
8
|
Strain identification and quantitative analysis in microbial communities. J Mol Biol 2022; 434:167582. [DOI: 10.1016/j.jmb.2022.167582] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 03/31/2022] [Accepted: 04/03/2022] [Indexed: 12/14/2022]
|
9
|
Ventolero MF, Wang S, Hu H, Li X. Computational analyses of bacterial strains from shotgun reads. Brief Bioinform 2022; 23:6524011. [PMID: 35136954 DOI: 10.1093/bib/bbac013] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Revised: 01/10/2022] [Accepted: 01/11/2022] [Indexed: 12/21/2022] Open
Abstract
Shotgun sequencing is routinely employed to study bacteria in microbial communities. With the vast amount of shotgun sequencing reads generated in a metagenomic project, it is crucial to determine the microbial composition at the strain level. This study investigated 20 computational tools that attempt to infer bacterial strain genomes from shotgun reads. For the first time, we discussed the methodology behind these tools. We also systematically evaluated six novel-strain-targeting tools on the same datasets and found that BHap, mixtureS and StrainFinder performed better than other tools. Because the performance of the best tools is still suboptimal, we discussed future directions that may address the limitations.
Collapse
Affiliation(s)
| | - Saidi Wang
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Haiyan Hu
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA.,Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA
| | - Xiaoman Li
- Burnett School of Biomedical Science, University of Central Florida, Orlando, FL 32816, USA
| |
Collapse
|
10
|
Reconstruction of evolving gene variants and fitness from short sequencing reads. Nat Chem Biol 2021; 17:1188-1198. [PMID: 34635842 PMCID: PMC8551035 DOI: 10.1038/s41589-021-00876-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Accepted: 08/09/2021] [Indexed: 12/23/2022]
Abstract
Directed evolution can generate proteins with tailor-made activities. However, full-length genotypes, their frequencies and fitnesses are difficult to measure for evolving gene-length biomolecules using most high-throughput DNA sequencing methods, as short read lengths can lose mutation linkages in haplotypes. Here we present Evoracle, a machine learning method that accurately reconstructs full-length genotypes (R2 = 0.94) and fitness using short-read data from directed evolution experiments, with substantial improvements over related methods. We validate Evoracle on phage-assisted continuous evolution (PACE) and phage-assisted non-continuous evolution (PANCE) of adenine base editors and OrthoRep evolution of drug-resistant enzymes. Evoracle retains strong performance (R2 = 0.86) on data with complete linkage loss between neighboring nucleotides and large measurement noise, such as pooled Sanger sequencing data (~US$10 per timepoint), and broadens the accessibility of training machine learning models on gene variant fitnesses. Evoracle can also identify high-fitness variants, including low-frequency 'rising stars', well before they are identifiable from consensus mutations.
Collapse
|
11
|
Multilocus sequence analysis reveals genetic diversity in Staphylococcus aureus isolate of goat with mastitis persistent after treatment with enrofloxacin. Sci Rep 2021; 11:17252. [PMID: 34446803 PMCID: PMC8390490 DOI: 10.1038/s41598-021-96764-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 07/28/2021] [Indexed: 02/07/2023] Open
Abstract
Staphylococcus aureus is one of the main bacterial agents responsible for cases of mastitis in ruminants, playing an important role in the persistence and chronicity of diseases treated with antimicrobials. Using the multilocus sequence typing technique, network approaches and study of the population diversity of microorganisms, we performed analyzes of S. aureus (ES-GPM) isolated from goats with persistent mastitis (GPM). The most strains of ES-GPM were categorically different phylogenetically from the others and could be divided into two lineages: one with a majority belonging to ES-GPM and the other to varied strains. These two lineages were separated by 27 nuclear polymorphisms. The 43 strains comprised 22 clonal complexes (CCs), of which the ES-GPM strains were present in CC133, CC5 and a new complex formed by the sequence type 4966. The genetic diversity of some alleles showed be greater diversity and polymorphism than others, such as of the aroE and yqiL genes less than glpF gene. In addition, the sequences ES-GPM to the arc gene and glpF alleles showed the greatest number of mutations for ES-GPM in relation to non-ES-GPM. Therefore, this study identified genetic polymorphisms characteristic of S. aureus isolated from milk of goats diagnosed with persistent mastitis after the failed treatment with the antibiotic enrofloxacin. This study may help in the future to identify and discriminate this agent in cases of mastitis, and with that, the most appropriate antibiotic treatment can be performed in advance of the appearance of persistent mastitis caused by the agent, reducing the chances of premature culling and animal suffering.
Collapse
|
12
|
Li X, Hu H, Li X. mixtureS: a novel tool for bacterial strain genome reconstruction from reads. Bioinformatics 2021; 37:575-577. [PMID: 32805048 PMCID: PMC8599889 DOI: 10.1093/bioinformatics/btaa728] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2020] [Revised: 07/14/2020] [Accepted: 08/10/2020] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION It is essential to study bacterial strains in environmental samples. Existing methods and tools often depend on known strains or known variations, cannot work on individual samples, not reliable, or not easy to use, etc. It is thus important to develop more user-friendly tools that can identify bacterial strains more accurately. RESULTS We developed a new tool called mixtureS that can de novo identify bacterial strains from shotgun reads of a clonal or metagenomic sample, without prior knowledge about the strains and their variations. Tested on 243 simulated datasets and 195 experimental datasets, mixtureS reliably identified the strains, their numbers and their abundance. Compared with three tools, mixtureS showed better performance in almost all simulated datasets and the vast majority of experimental datasets. AVAILABILITY AND IMPLEMENTATION The source code and tool mixtureS is available at http://www.cs.ucf.edu/˜xiaoman/mixtureS/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xin Li
- Department of Computer Science
| | | | - Xiaoman Li
- Burnett School of Biomedical Science, College of Medicine, University of Central Florida, Orlando, FL 32816, USA
| |
Collapse
|
13
|
Abstract
Cystic fibrosis patients frequently suffer from recurring respiratory infections caused by colonizing pathogenic and commensal bacteria. Although modern therapies can sometimes alleviate respiratory symptoms by ameliorating residual function of the protein responsible for the disorder, management of chronic respiratory infections remains an issue. In cystic fibrosis, dynamic and complex communities of microbial pathogens and commensals can colonize the lung. Cultured isolates from lung sputum reveal high inter- and intraindividual variability in pathogen strains, sequence variants, and phenotypes; disease progression likely depends on the precise combination of infecting lineages. Routine clinical protocols, however, provide a limited overview of the colonizer populations. Therefore, a more comprehensive and precise identification and characterization of infecting lineages could assist in making corresponding decisions on treatment. Here, we describe longitudinal tracking for four cystic fibrosis patients who exhibited extreme clinical phenotypes and, thus, were selected from a pilot cohort of 11 patients with repeated sampling for more than a year. Following metagenomics sequencing of lung sputum, we find that the taxonomic identity of individual colonizer lineages can be easily established. Crucially, even superficially clonal pathogens can be subdivided into multiple sublineages at the sequence level. By tracking individual allelic differences over time, an assembly-free clustering approach allows us to reconstruct multiple lineage-specific genomes with clear structural differences. Our study showcases a culture-independent shotgun metagenomics approach for longitudinal tracking of sublineage pathogen dynamics, opening up the possibility of using such methods to assist in monitoring disease progression through providing high-resolution routine characterization of the cystic fibrosis lung microbiome.
Collapse
|
14
|
Cao C, He J, Mak L, Perera D, Kwok D, Wang J, Li M, Mourier T, Gavriliuc S, Greenberg M, Morrissy AS, Sycuro LK, Yang G, Jeffares DC, Long Q. Reconstruction of Microbial Haplotypes by Integration of Statistical and Physical Linkage in Scaffolding. Mol Biol Evol 2021; 38:2660-2672. [PMID: 33547786 PMCID: PMC8136496 DOI: 10.1093/molbev/msab037] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
DNA sequencing technologies provide unprecedented opportunities to analyze within-host evolution of microorganism populations. Often, within-host populations are analyzed via pooled sequencing of the population, which contains multiple individuals or "haplotypes." However, current next-generation sequencing instruments, in conjunction with single-molecule barcoded linked-reads, cannot distinguish long haplotypes directly. Computational reconstruction of haplotypes from pooled sequencing has been attempted in virology, bacterial genomics, metagenomics, and human genetics, using algorithms based on either cross-host genetic sharing or within-host genomic reads. Here, we describe PoolHapX, a flexible computational approach that integrates information from both genetic sharing and genomic sequencing. We demonstrated that PoolHapX outperforms state-of-the-art tools tailored to specific organismal systems, and is robust to within-host evolution. Importantly, together with barcoded linked-reads, PoolHapX can infer whole-chromosome-scale haplotypes from 50 pools each containing 12 different haplotypes. By analyzing real data, we uncovered dynamic variations in the evolutionary processes of within-patient HIV populations previously unobserved in single position-based analysis.
Collapse
Affiliation(s)
- Chen Cao
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Jingni He
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Cardiology, Xiangya Hospital, Central South University, Changsha, China
| | - Lauren Mak
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Present address: Tri-Institutional Computational Biology & Medicine Program, Weill Cornell Medicine of Cornell University, New York, NY, USA
| | - Deshan Perera
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Devin Kwok
- Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada
| | - Jia Wang
- Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL, USA
| | - Minghao Li
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Tobias Mourier
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Stefan Gavriliuc
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Matthew Greenberg
- Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada
| | - A Sorana Morrissy
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Laura K Sycuro
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Microbiology, Immunology, and Infectious Diseases, Snyder Institute for Chronic Diseases, University of Calgary, Calgary, AB, Canada
| | - Guang Yang
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Medical Genetics, University of Calgary, Calgary, AB, Canada
| | - Daniel C Jeffares
- Department of Biology, York Biomedical Research Institute, University of York, York, United Kingdom
| | - Quan Long
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada,Department of Medical Genetics, University of Calgary, Calgary, AB, Canada,Hotchkiss Brain Institute, O’Brien Institute for Public Health, University of Calgary, Calgary, AB, Canada,Corresponding author: E-mail:
| |
Collapse
|