1
|
Marin MG, Quinones-Olvera N, Wippel C, Behruznia M, Jeffrey BM, Harris M, Mann BC, Rosenthal A, Jacobson KR, Warren RM, Li H, Meehan CJ, Farhat MR. Pitfalls of bacterial pan-genome analysis approaches: a case study of Mycobacterium tuberculosis and two less clonal bacterial species. Bioinformatics 2025; 41:btaf219. [PMID: 40341387 PMCID: PMC12119186 DOI: 10.1093/bioinformatics/btaf219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Revised: 12/31/2024] [Accepted: 05/07/2025] [Indexed: 05/10/2025] Open
Abstract
SUMMARY Pan-genome analysis is a fundamental tool for studying bacterial genome evolution; however, the variety in methods used to define and measure the pan-genome poses challenges to the interpretation and reliability of results. Using Mycobacterium tuberculosis, a clonally evolving bacterium with a small accessory genome, as a model system, we systematically evaluated sources of variability in pan-genome estimates. Our analysis revealed that differences in assembly type (short-read versus hybrid), annotation pipeline, and pan-genome software, significantly impact predictions of core and accessory genome size. Extending our analysis to two additional bacterial species, Escherichia coli and Staphylococcus aureus, we observed consistent tool-dependent biases but species-specific patterns in pan-genome variability. Our findings highlight the importance of integrating nucleotide- and protein-level analyses to improve the reliability and reproducibility of pan-genome studies across diverse bacterial populations. AVAILABILITY AND IMPLEMENTATION Panqc is freely available under an MIT license at https://github.com/maxgmarin/panqc.
Collapse
Affiliation(s)
- Maximillian G Marin
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
| | - Natalia Quinones-Olvera
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
| | - Christoph Wippel
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
| | - Mahboobeh Behruznia
- Department of Biosciences, Nottingham Trent University, Nottingham, NG1 4FQ, United Kingdom
| | - Brendan M Jeffrey
- Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, 20892, United States
| | - Michael Harris
- Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, 20892, United States
| | - Brendon C Mann
- Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research, Stellenbosch University, Stellenbosch, Western Cape, 7602, South Africa
| | - Alex Rosenthal
- Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, 20892, United States
| | - Karen R Jacobson
- Division of Infectious Diseases, Chobanian & Avedisian School of Medicine, Boston University, Boston, MA 02118, United States
| | - Robin M Warren
- Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research, Stellenbosch University, Stellenbosch, Western Cape, 7602, South Africa
| | - Heng Li
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States
- Broad Institute of Harvard and MIT, Cambridge, MA 02142, United States
| | - Conor J Meehan
- Department of Biosciences, Nottingham Trent University, Nottingham, NG1 4FQ, United Kingdom
- Unit of Mycobacteriology, Institute of Tropical Medicine, Antwerp, 2000, Belgium
| | - Maha R Farhat
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
- Pulmonary and Critical Care Medicine, Massachusetts General Hospital, Boston, MA 02114, United States
| |
Collapse
|
2
|
Espinoza ME, Swing AM, Elghraoui A, Modlin SJ, Valafar F. Interred mechanisms of resistance and host immune evasion revealed through network-connectivity analysis of M. tuberculosis complex graph pangenome. mSystems 2025; 10:e0049924. [PMID: 40261029 PMCID: PMC12013269 DOI: 10.1128/msystems.00499-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Accepted: 12/16/2024] [Indexed: 04/24/2025] Open
Abstract
Mycobacterium tuberculosis complex successfully adapts to environmental pressures through mechanisms of rapid adaptation which remain poorly understood despite knowledge gained through decades of research. In this study, we used 110 reference-quality, complete de novo assembled, long-read sequenced clinical genomes to study patterns of structural adaptation through a graph-based pangenome analysis, elucidating rarely studied mechanisms that enable enhanced clinical phenotypes offering a novel perspective to the species' adaptation. Across isolates, we identified a pangenome of 4,325 genes (3,767 core and 558 accessory), revealing 290 novel genes, and a substantially more complete account of difficult-to-sequence esx/pe/pgrs/ppe genes. Seventy-four percent of core genes were deemed non-essential in vitro, 38% of which support the pathogen's survival in vivo, suggesting a need to broaden current perspectives on essentiality. Through information-theoretic analysis, we reveal the ppe genes that contribute most to the species' diversity-several with known consequences for antigenic variation and immune evasion. Construction of a graph pangenome revealed topological variations that implicate genes known to modulate host immunity (Rv0071-73, Rv2817c, cas2), defense against phages/viruses (cas2, csm6, and Rv2817c-2821c), and others associated with host tissue colonization. Here, the prominent trehalose transport pathway stands out for its involvement in caseous granuloma catabolism and the development of post-primary disease. We show paralogous duplications of genes implicated in bedaquiline (mmpL5 in all L1 isolates) and ethambutol (embC-A) resistance, with a paralogous duplication of its regulator (embR) in 96 isolates. We provide hypotheses for novel mechanisms of immune evasion and antibiotic resistance through gene dosing that can escape detection by molecular diagnostics.IMPORTANCEM. tuberculosis complex (MTBC) has killed over a billion people in the past 200 years alone and continues to kill nearly 1.5 million annually. The pathogen has a versatile ability to diversify under immune and drug pressure and survive, even becoming antibiotic persistent or resistant in the face of harsh chemotherapy. For proper diagnosis and design of an appropriate treatment regimen, a full understanding of this diversification and its clinical consequences is desperately needed. A mechanism of diversification that is rarely studied systematically is MTBC's ability to structurally change its genome. In this article, we have de novo assembled 110 clinical genomes (the largest de novo assembled set to date) and performed a pangenomic analysis. Our pangenome provides structural variation-based hypotheses for novel mechanisms of immune evasion and antibiotic resistance through gene dosing that can compromise molecular diagnostics and lead to further emergence of antibiotic resistance.
Collapse
Affiliation(s)
- Monica E. Espinoza
- Laboratory for Pathogenesis of Clinical Drug Resistance and Persistence, San Diego State University, San Diego, California, USA
| | - Ashley M. Swing
- Laboratory for Pathogenesis of Clinical Drug Resistance and Persistence, San Diego State University, San Diego, California, USA
- San Diego State University/University of California, San Diego | Joint Doctoral Program in Public Health (Global Health), San Diego, California, USA
| | - Afif Elghraoui
- Laboratory for Pathogenesis of Clinical Drug Resistance and Persistence, San Diego State University, San Diego, California, USA
- Department of Electrical and Computer Engineering, San Diego State University, San Diego, California, USA
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, California, USA
| | - Samuel J. Modlin
- Laboratory for Pathogenesis of Clinical Drug Resistance and Persistence, San Diego State University, San Diego, California, USA
| | - Faramarz Valafar
- Laboratory for Pathogenesis of Clinical Drug Resistance and Persistence, San Diego State University, San Diego, California, USA
| |
Collapse
|
3
|
Silva-Pereira TT, Soler-Camargo NC, Guimarães AMS. Diversification of gene content in the Mycobacterium tuberculosis complex is determined by phylogenetic and ecological signatures. Microbiol Spectr 2024; 12:e0228923. [PMID: 38230932 PMCID: PMC10871547 DOI: 10.1128/spectrum.02289-23] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Accepted: 12/19/2023] [Indexed: 01/18/2024] Open
Abstract
We analyzed the pan-genome and gene content modulation of the most diverse genome data set of the Mycobacterium tuberculosis complex (MTBC) gathered to date. The closed pan-genome of the MTBC was characterized by reduced accessory and strain-specific genomes, compatible with its clonal nature. However, significantly fewer gene families were shared between MTBC genomes as their phylogenetic distance increased. This effect was only observed in inter-species comparisons, not within-species, which suggests that species-specific ecological characteristics are associated with changes in gene content. Gene loss, resulting from genomic deletions and pseudogenization, was found to drive the variation in gene content. This gene erosion differed among MTBC species and lineages, even within M. tuberculosis, where L2 showed more gene loss than L4. We also show that phylogenetic proximity is not always a good proxy for gene content relatedness in the MTBC, as the gene repertoire of Mycobacterium africanum L6 deviated from its expected phylogenetic niche conservatism. Gene disruptions of virulence factors, represented by pseudogene annotations, are mostly not conserved, being poor predictors of MTBC ecotypes. Each MTBC ecotype carries its own accessory genome, likely influenced by distinct selective pressures such as host and geography. It is important to investigate how gene loss confer new adaptive traits to MTBC strains; the detected heterogeneous gene loss poses a significant challenge in elucidating genetic factors responsible for the diverse phenotypes observed in the MTBC. By detailing specific gene losses, our study serves as a resource for researchers studying the MTBC phenotypes and their immune evasion strategies.IMPORTANCEIn this study, we analyzed the gene content of different ecotypes of the Mycobacterium tuberculosis complex (MTBC), the pathogens of tuberculosis. We found that changes in their gene content are associated with their ecological features, such as host preference. Gene loss was identified as the primary driver of these changes, which can vary even among different strains of the same ecotype. Our study also revealed that the gene content relatedness of these bacteria does not always mirror their evolutionary relationships. In addition, some genes of virulence can be variably lost among strains of the same MTBC ecotype, likely helping them to evade the immune system. Overall, our study highlights the importance of understanding how gene loss can lead to new adaptations in these bacteria and how different selective pressures may influence their genetic makeup.
Collapse
Affiliation(s)
- Taiana Tainá Silva-Pereira
- Laboratory of Applied Research in Mycobacteria, Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil
| | - Naila Cristina Soler-Camargo
- Laboratory of Applied Research in Mycobacteria, Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil
- Department of Preventive Veterinary Medicine and Animal Health, School of Veterinary Medicine and Animal Sciences, University of São Paulo, São Paulo, Brazil
| | - Ana Marcia Sá Guimarães
- Laboratory of Applied Research in Mycobacteria, Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil
| |
Collapse
|
4
|
Ardern Z. Alternative Reading Frames are an Underappreciated Source of Protein Sequence Novelty. J Mol Evol 2023; 91:570-580. [PMID: 37326679 DOI: 10.1007/s00239-023-10122-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Accepted: 05/31/2023] [Indexed: 06/17/2023]
Abstract
Protein-coding DNA sequences can be translated into completely different amino acid sequences if the nucleotide triplets used are shifted by a non-triplet amount on the same DNA strand or by translating codons from the opposite strand. Such "alternative reading frames" of protein-coding genes are a major contributor to the evolution of novel protein products. Recent studies demonstrating this include examples across the three domains of cellular life and in viruses. These sequences increase the number of trials potentially available for the evolutionary invention of new genes and also have unusual properties which may facilitate gene origin. There is evidence that the structure of the standard genetic code contributes to the features and gene-likeness of some alternative frame sequences. These findings have important implications across diverse areas of molecular biology, including for genome annotation, structural biology, and evolutionary genomics.
Collapse
|