1
|
Choi Y, De Ridder D, Greub G. Genomic and spatial epidemiology: lessons learned from SARS-CoV-2 pandemic. Curr Opin HIV AIDS 2025; 20:287-293. [PMID: 40172549 PMCID: PMC11970598 DOI: 10.1097/coh.0000000000000936] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/04/2025]
Abstract
PURPOSE OF REVIEW The SARS-CoV-2 pandemic presented unprecedented challenges, particularly in understanding its complex spatial transmission patterns. The high transmissibility of the virus led to frequent super-spreading events. These events demonstrated clear spatial clustering patterns, often tied to specific events that facilitated transmission. The uneven geographic distribution of medical resources and varying access to care amplified the impact of SARS-CoV-2. Asymptomatic cases further complicated the situation, as infected individuals could silently spread the virus before being identified.Thus, this review examines how genomic and spatial epidemiology approaches can be integrated to answer some of the above-mentioned challenges. We first describe the methodological foundations of genomics and spatial epidemiology, detailing opportunities of their applications during the SARS-CoV-2 pandemic. We then present a novel interdisciplinary framework that combines these approaches to better guide public health interventions. RECENT FINDINGS During the pandemic, the genomic and spatial approaches were used to address key questions, including "how does the pathogen evolve and diversify?" and "how does the pathogen spread geographically?". Genomic epidemiology allows researchers to identify viral lineages and new variants. Conversely, spatial epidemiology focused on geographic distribution of infections, analyzing how the virus spread. However, despite their complementary nature, these approaches were largely applied independently during the pandemic. This separation limited our collective ability to fully understand the complex relationships between viral evolution and geographic spread. SUMMARY While phylogeography has traditionally combined phylogenetic and geographic data to understand long-term evolutionary patterns across large areas, events such as the recent SARS-CoV-2 pandemic demand frameworks that can inform public health interventions through joint analysis of genomic and local-scale spatial data.
Collapse
Affiliation(s)
- Yangji Choi
- Institute of Microbiology, Lausanne University Hospital and University of Lausanne
| | - David De Ridder
- Group of Geospatial Molecular Epidemiology (GEOME), Laboratory for Biological Geochemistry (LGB), School of Architecture, Civil and Environmental Engineering (ENAC), École Polytechnique Fédérale de Lausanne (EPFL), Lausanne
- Group of Geographic Information Research and Analysis in Population Health (GIRAPH)
- Faculty of Medicine, University of Geneva (UNIGE)
- Division and Department of Primary Care Medicine, Geneva University Hospitals, Geneva
| | - Gilbert Greub
- Institute of Microbiology, Lausanne University Hospital and University of Lausanne
- Service of Infectious Diseases, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
| |
Collapse
|
2
|
Wirth T, Kumar KR, Zech M. Long-Read Sequencing: The Third Generation of Diagnostic Testing for Dystonia. Mov Disord 2025. [PMID: 40265723 DOI: 10.1002/mds.30208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2024] [Revised: 03/14/2025] [Accepted: 04/02/2025] [Indexed: 04/24/2025] Open
Abstract
Long-read sequencing methodologies provide powerful capacity to identify all types of genomic variations in a single test. Long-read platforms such as Oxford Nanopore and PacBio have the potential to revolutionize molecular diagnostics by reaching unparalleled accuracies in genetic discovery and long-range phasing. In the field of dystonia, promising results have come from recent pilot studies showing improved detection of disease-causing structural variants and repeat expansions. Increases in throughput and ongoing reductions in cost will facilitate the incorporation of long-read approaches into mainstream diagnostic practice. Although these developments are likely to transform clinical care, there is currently a discrepancy between the potential benefits of long-read sequencing and the application of this technique to dystonia. In this review we highlight current opportunities and limitations of adopting long-read sequencing methods for the investigation of patients with dystonia. We provide examples of long-read sequencing integration into diagnostic evaluation and the study of pathomechanisms in individuals with dystonic disorders. The goal of this article is to stimulate research into the application and optimization of long-read analysis strategies in dystonia, thus enabling more precise understanding of the underlying etiology in the future. © 2025 The Author(s). Movement Disorders published by Wiley Periodicals LLC on behalf of International Parkinson and Movement Disorder Society.
Collapse
Affiliation(s)
- Thomas Wirth
- Neurology Department, Strasbourg University Hospital, Strasbourg, France
- Institute of Genetics and of Molecular and Cellular Biology (IGBMC), INSERM-U964/CNRS-UMR7104/Strasbourg University, Illkirch-Graffenstaden, France
- Strasbourg Translational Medicine Federation (FMTS), Strasbourg University, Strasbourg, France
| | - Kishore R Kumar
- Translational Neurogenomics Group, Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Darlinghurst, New South Wales, Australia
- Faculty of Medicine and Health, University of Sydney, Sydney, New South Wales, Australia
- Department of Neurology and Molecular Medicine Laboratory, Concord Repatriation General Hospital, Concord, New South Wales, Australia
- School of Clinical Medicine, UNSW Medicine & Health, UNSW Sydney, Sydney, New South Wales, Australia
| | - Michael Zech
- Institute of Human Genetics, Technical University of Munich, School of Medicine and Health, Munich, Germany
- Institute of Neurogenomics, Helmholtz Munich, Neuherberg, Germany
- Institute for Advanced Study, Technical University of Munich, Garching, Germany
| |
Collapse
|
3
|
Espinoza ME, Swing AM, Elghraoui A, Modlin SJ, Valafar F. Interred mechanisms of resistance and host immune evasion revealed through network-connectivity analysis of M. tuberculosis complex graph pangenome. mSystems 2025; 10:e0049924. [PMID: 40261029 PMCID: PMC12013269 DOI: 10.1128/msystems.00499-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Accepted: 12/16/2024] [Indexed: 04/24/2025] Open
Abstract
Mycobacterium tuberculosis complex successfully adapts to environmental pressures through mechanisms of rapid adaptation which remain poorly understood despite knowledge gained through decades of research. In this study, we used 110 reference-quality, complete de novo assembled, long-read sequenced clinical genomes to study patterns of structural adaptation through a graph-based pangenome analysis, elucidating rarely studied mechanisms that enable enhanced clinical phenotypes offering a novel perspective to the species' adaptation. Across isolates, we identified a pangenome of 4,325 genes (3,767 core and 558 accessory), revealing 290 novel genes, and a substantially more complete account of difficult-to-sequence esx/pe/pgrs/ppe genes. Seventy-four percent of core genes were deemed non-essential in vitro, 38% of which support the pathogen's survival in vivo, suggesting a need to broaden current perspectives on essentiality. Through information-theoretic analysis, we reveal the ppe genes that contribute most to the species' diversity-several with known consequences for antigenic variation and immune evasion. Construction of a graph pangenome revealed topological variations that implicate genes known to modulate host immunity (Rv0071-73, Rv2817c, cas2), defense against phages/viruses (cas2, csm6, and Rv2817c-2821c), and others associated with host tissue colonization. Here, the prominent trehalose transport pathway stands out for its involvement in caseous granuloma catabolism and the development of post-primary disease. We show paralogous duplications of genes implicated in bedaquiline (mmpL5 in all L1 isolates) and ethambutol (embC-A) resistance, with a paralogous duplication of its regulator (embR) in 96 isolates. We provide hypotheses for novel mechanisms of immune evasion and antibiotic resistance through gene dosing that can escape detection by molecular diagnostics.IMPORTANCEM. tuberculosis complex (MTBC) has killed over a billion people in the past 200 years alone and continues to kill nearly 1.5 million annually. The pathogen has a versatile ability to diversify under immune and drug pressure and survive, even becoming antibiotic persistent or resistant in the face of harsh chemotherapy. For proper diagnosis and design of an appropriate treatment regimen, a full understanding of this diversification and its clinical consequences is desperately needed. A mechanism of diversification that is rarely studied systematically is MTBC's ability to structurally change its genome. In this article, we have de novo assembled 110 clinical genomes (the largest de novo assembled set to date) and performed a pangenomic analysis. Our pangenome provides structural variation-based hypotheses for novel mechanisms of immune evasion and antibiotic resistance through gene dosing that can compromise molecular diagnostics and lead to further emergence of antibiotic resistance.
Collapse
Affiliation(s)
- Monica E. Espinoza
- Laboratory for Pathogenesis of Clinical Drug Resistance and Persistence, San Diego State University, San Diego, California, USA
| | - Ashley M. Swing
- Laboratory for Pathogenesis of Clinical Drug Resistance and Persistence, San Diego State University, San Diego, California, USA
- San Diego State University/University of California, San Diego | Joint Doctoral Program in Public Health (Global Health), San Diego, California, USA
| | - Afif Elghraoui
- Laboratory for Pathogenesis of Clinical Drug Resistance and Persistence, San Diego State University, San Diego, California, USA
- Department of Electrical and Computer Engineering, San Diego State University, San Diego, California, USA
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, California, USA
| | - Samuel J. Modlin
- Laboratory for Pathogenesis of Clinical Drug Resistance and Persistence, San Diego State University, San Diego, California, USA
| | - Faramarz Valafar
- Laboratory for Pathogenesis of Clinical Drug Resistance and Persistence, San Diego State University, San Diego, California, USA
| |
Collapse
|
4
|
Schell T, Greve C, Podsiadlowski L. Establishing genome sequencing and assembly for non-model and emerging model organisms: a brief guide. Front Zool 2025; 22:7. [PMID: 40247279 PMCID: PMC12004614 DOI: 10.1186/s12983-025-00561-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Accepted: 03/23/2025] [Indexed: 04/19/2025] Open
Abstract
Reference genome assemblies are the basis for comprehensive genomic analyses and comparisons. Due to declining sequencing costs and growing computational power, genome projects are now feasible in smaller labs. De novo genome sequencing for non-model or emerging model organisms requires knowledge about genome size and techniques for extracting high molecular weight DNA. Next to quality, the amount of DNA obtained from single individuals is crucial, especially, when dealing with small organisms. While long-read sequencing technologies are the methods of choice for creating high quality genome assemblies, pure short-read assemblies might bear most of the coding parts of a genome but are usually much more fragmented and do not well resolve repeat elements or structural variants. Several genome initiatives produce more and more non-model organism genomes and provide rules for standards in genome sequencing and assembly. However, sometimes the organism of choice is not part of such an initiative or does not meet its standards. Therefore, if the scientific question can be answered with a genome of low contiguity in intergenic parts, missing the high standards of chromosome scale assembly should not prevent publication. This review describes how to set up an animal genome sequencing project in the lab, how to estimate costs and resources, and how to deal with suboptimal conditions. Thus, we aim to suggest optimal strategies for genome sequencing that fulfil the needs according to specific research questions, e.g. "How are species related to each other based on whole genomes?" (phylogenomics), "How do genomes of populations within a species differ?" (population genomics), "Are differences between populations relevant for conservation?" (conservation genomics), "Which selection pressure is acting on certain genes?" (identification of genes under selection), "Did repeats expand or contract recently?" (repeat dynamics).
Collapse
Affiliation(s)
- Tilman Schell
- LOEWE Centre for Translational Biodiversity Genomics, Senckenberganlage 25, 60325, Frankfurt, Germany
- Senckenberg Research Institute, Senckenberganlage 25, 60325, Frankfurt, Germany
| | - Carola Greve
- LOEWE Centre for Translational Biodiversity Genomics, Senckenberganlage 25, 60325, Frankfurt, Germany
- Senckenberg Research Institute, Senckenberganlage 25, 60325, Frankfurt, Germany
| | - Lars Podsiadlowski
- LIB, Museum Koenig Bonn, Centre for Molecular Biodiversity Research (zmb), Adenauerallee 127, 53113, Bonn, Germany.
| |
Collapse
|
5
|
Zhang E, Coombe L, Wong J, Warren RL, Birol I. GoldPolish-target: targeted long-read genome assembly polishing. BMC Bioinformatics 2025; 26:78. [PMID: 40055584 PMCID: PMC11887200 DOI: 10.1186/s12859-025-06091-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Accepted: 02/19/2025] [Indexed: 03/12/2025] Open
Abstract
BACKGROUND Advanced long-read sequencing technologies, such as those from Oxford Nanopore Technologies and Pacific Biosciences, are finding a wide use in de novo genome sequencing projects. However, long reads typically have higher error rates relative to short reads. If left unaddressed, subsequent genome assemblies may exhibit high base error rates that compromise the reliability of downstream analysis. Several specialized error correction tools for genome assemblies have since emerged, employing a range of algorithms and strategies to improve base quality. However, despite these efforts, many genome assembly workflows still produce regions with elevated error rates, such as gaps filled with unpolished or ambiguous bases. To address this, we introduce GoldPolish-Target, a modular targeted sequence polishing pipeline. Coupled with GoldPolish, a linear-time genome assembly algorithm, GoldPolish-Target isolates and polishes user-specified assembly loci, offering a resource-efficient means for polishing targeted regions of draft genomes. RESULTS Experiments using Drosophila melanogaster and Homo sapiens datasets demonstrate that GoldPolish-Target can reduce insertion/deletion (indel) and mismatch errors by up to 49.2% and 55.4% respectively, achieving base accuracy values upwards of 99.9% (Phred score Q > 30). This polishing accuracy is comparable to the current state-of-the-art, Medaka, while exhibiting up to 27-fold shorter run times and consuming 95% less memory, on average. CONCLUSION GoldPolish-Target, in contrast to most other polishing tools, offers the ability to target specific regions of a genome assembly for polishing, providing a computationally light-weight and highly scalable solution for base error correction.
Collapse
Affiliation(s)
- Emily Zhang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| | - Lauren Coombe
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Johnathan Wong
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| |
Collapse
|
6
|
Jeanjean S, Shen Y, Hardy L, Daunay A, Delépine M, Gerber Z, Alberdi A, Tubacher E, Deleuze JF, How-Kit A. A detailed analysis of second and third-generation sequencing approaches for accurate length determination of short tandem repeats and homopolymers. Nucleic Acids Res 2025; 53:gkaf131. [PMID: 40036507 PMCID: PMC11878640 DOI: 10.1093/nar/gkaf131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Revised: 01/13/2025] [Accepted: 02/11/2025] [Indexed: 03/06/2025] Open
Abstract
Microsatellites are short tandem repeats (STRs) of a motif of 1-6 nucleotides that are ubiquitous in almost all genomes and widely used in many biomedical applications. However, despite the development of next-generation sequencing (NGS) over the past two decades with new technologies coming to the market, accurately sequencing and genotyping STRs, particularly homopolymers, remain very challenging today due to several technical limitations. This leads in many cases to erroneous allele calls and difficulty in correctly identifying the genuine allele distribution in a sample. Here, we assessed several second and third-generation sequencing approaches in their capability to correctly determine the length of microsatellites using plasmids containing A/T homopolymers, AC/TG or AT/TA dinucleotide STRs of variable length. Standard polymerase chain reaction (PCR)-free and PCR-containing, single Unique Molecular Indentifier (UMI) and dual UMI 'duplex sequencing' protocols were evaluated using Illumina short-read sequencing, and two PCR-free protocols using PacBio and Oxford Nanopore Technologies long-read sequencing. Several bioinformatics algorithms were developed to correctly identify microsatellite alleles from sequencing data, including four and two modes for generating standard and combined consensus alleles, respectively. We provided a detailed analysis and comparison of these approaches and made several recommendations for the accurate determination of microsatellite allele length.
Collapse
Affiliation(s)
- Sophie I Jeanjean
- Laboratory for Genomics, Foundation Jean Dausset – CEPH, 75010 Paris, France
| | - Yimin Shen
- Laboratory for Bioinformatics, Foundation Jean Dausset – CEPH, 75010 Paris, France
| | - Lise M Hardy
- Laboratory for Genomics, Foundation Jean Dausset – CEPH, 75010 Paris, France
| | - Antoine Daunay
- Laboratory for Genomics, Foundation Jean Dausset – CEPH, 75010 Paris, France
| | - Marc Delépine
- Centre National de Recherche en Génomique Humaine (CNRGH), CEA, Institut François Jacob, 91000 Evry, France
| | - Zuzana Gerber
- Centre National de Recherche en Génomique Humaine (CNRGH), CEA, Institut François Jacob, 91000 Evry, France
| | - Antonio Alberdi
- Technological Platform of Saint-Louis Research Institute (IRSL), Saint-Louis Hospital, University of Paris, 75010 Paris, France
| | - Emmanuel Tubacher
- Laboratory for Bioinformatics, Foundation Jean Dausset – CEPH, 75010 Paris, France
| | - Jean-François Deleuze
- Laboratory for Genomics, Foundation Jean Dausset – CEPH, 75010 Paris, France
- Laboratory for Bioinformatics, Foundation Jean Dausset – CEPH, 75010 Paris, France
- Centre National de Recherche en Génomique Humaine (CNRGH), CEA, Institut François Jacob, 91000 Evry, France
| | - Alexandre How-Kit
- Laboratory for Genomics, Foundation Jean Dausset – CEPH, 75010 Paris, France
| |
Collapse
|
7
|
Ortigas-Vasquez A, Bowen CD, Renner DW, Baigent SJ, Zhang Y, Yao Y, Nair V, Kennedy DA, Szpara ML. High-Fidelity Long-Read Sequencing of an Avian Herpesvirus Reveals Extensive Intrapopulation Diversity in Tandem Repeat Regions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.10.637388. [PMID: 39990410 PMCID: PMC11844383 DOI: 10.1101/2025.02.10.637388] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/25/2025]
Abstract
Comparative genomic studies of Marek's disease virus (MDV) have suggested that attenuated and virulent strains share >98% sequence identity. However, these estimates fail to account for variation in regions of the MDV genome harboring tandem repeats. To resolve these loci and enable assessments of intrapopulation diversity, we used a PacBio Sequel II platform to sequence MDV strains CVI988/Rispens (attenuated), HPRS-B14 (virulent), Md5 (very virulent) and 675A (very virulent plus). This approach enabled us to identify patterns of variation in tandem repeat regions that are consistent with known phenotypic differences across these strains. We also found CVI988/Rispens variants showing a 4.3-kb deletion in the Unique Short (US) region, resulting in the loss of five genes. These findings support a potential link between MDV tandem repeats and phenotypic traits like virulence and attenuation, and demonstrate that DNA viruses can harbor high levels of intrapopulation diversity in tandem repeat regions.
Collapse
Affiliation(s)
- Alejandro Ortigas-Vasquez
- Departments of Biology, Center for Infectious Disease Dynamics and the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Christopher D. Bowen
- Departments of Biology, Center for Infectious Disease Dynamics and the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Daniel W. Renner
- Departments of Biology, Center for Infectious Disease Dynamics and the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Susan J. Baigent
- Viral Oncogenesis Group, The Pirbright Institute, Woking, UK, GU24 0NF
| | - Yaoyao Zhang
- Viral Oncogenesis Group, The Pirbright Institute, Woking, UK, GU24 0NF
| | - Yongxiu Yao
- Viral Oncogenesis Group, The Pirbright Institute, Woking, UK, GU24 0NF
| | - Venugopal Nair
- Viral Oncogenesis Group, The Pirbright Institute, Woking, UK, GU24 0NF
| | - David A. Kennedy
- Departments of Biology, Center for Infectious Disease Dynamics and the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Moriah L. Szpara
- Departments of Biology, Center for Infectious Disease Dynamics and the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Biochemistry and Molecular Biology, Center for Infectious Disease Dynamics and the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA
| |
Collapse
|
8
|
Bai D, Chen T, Xun J, Ma C, Luo H, Yang H, Cao C, Cao X, Cui J, Deng Y, Deng Z, Dong W, Dong W, Du J, Fang Q, Fang W, Fang Y, Fu F, Fu M, Fu Y, Gao H, Ge J, Gong Q, Gu L, Guo P, Guo Y, Hai T, Liu H, He J, He Z, Hou H, Huang C, Ji S, Jiang C, Jiang G, Jiang L, Jin LN, Kan Y, Kang D, Kou J, Lam K, Li C, Li C, Li F, Li L, Li M, Li X, Li Y, Li Z, Liang J, Lin Y, Liu C, Liu D, Liu F, Liu J, Liu T, Liu T, Liu X, Liu Y, Liu B, Liu M, Lou W, Luan Y, Luo Y, Lv H, Ma T, Mai Z, Mo J, Niu D, Pan Z, Qi H, Shi Z, Song C, Sun F, Sun Y, Tian S, Wan X, Wang G, Wang H, Wang H, Wang H, Wang J, Wang J, Wang K, Wang L, Wang S, Wang X, Wang Y, Xiao Z, Xing H, Xu Y, Yan S, Yang L, Yang S, Yang Y, Yao X, Yousuf S, Yu H, Lei Y, Yuan Z, et alBai D, Chen T, Xun J, Ma C, Luo H, Yang H, Cao C, Cao X, Cui J, Deng Y, Deng Z, Dong W, Dong W, Du J, Fang Q, Fang W, Fang Y, Fu F, Fu M, Fu Y, Gao H, Ge J, Gong Q, Gu L, Guo P, Guo Y, Hai T, Liu H, He J, He Z, Hou H, Huang C, Ji S, Jiang C, Jiang G, Jiang L, Jin LN, Kan Y, Kang D, Kou J, Lam K, Li C, Li C, Li F, Li L, Li M, Li X, Li Y, Li Z, Liang J, Lin Y, Liu C, Liu D, Liu F, Liu J, Liu T, Liu T, Liu X, Liu Y, Liu B, Liu M, Lou W, Luan Y, Luo Y, Lv H, Ma T, Mai Z, Mo J, Niu D, Pan Z, Qi H, Shi Z, Song C, Sun F, Sun Y, Tian S, Wan X, Wang G, Wang H, Wang H, Wang H, Wang J, Wang J, Wang K, Wang L, Wang S, Wang X, Wang Y, Xiao Z, Xing H, Xu Y, Yan S, Yang L, Yang S, Yang Y, Yao X, Yousuf S, Yu H, Lei Y, Yuan Z, Zeng M, Zhang C, Zhang C, Zhang H, Zhang J, Zhang N, Zhang T, Zhang Y, Zhang Y, Zhang Z, Zhou M, Zhou Y, Zhu C, Zhu L, Zhu Y, Zhu Z, Zou H, Zuo A, Dong W, Wen T, Chen S, Li G, Gao Y, Liu Y. EasyMetagenome: A user-friendly and flexible pipeline for shotgun metagenomic analysis in microbiome research. IMETA 2025; 4:e70001. [PMID: 40027489 PMCID: PMC11865343 DOI: 10.1002/imt2.70001] [Show More Authors] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/29/2024] [Accepted: 01/22/2025] [Indexed: 03/05/2025]
Abstract
Shotgun metagenomics has become a pivotal technology in microbiome research, enabling in-depth analysis of microbial communities at both the high-resolution taxonomic and functional levels. This approach provides valuable insights of microbial diversity, interactions, and their roles in health and disease. However, the complexity of data processing and the need for reproducibility pose significant challenges to researchers. To address these challenges, we developed EasyMetagenome, a user-friendly pipeline that supports multiple analysis methods, including quality control and host removal, read-based, assembly-based, and binning, along with advanced genome analysis. The pipeline also features customizable settings, comprehensive data visualizations, and detailed parameter explanations, ensuring its adaptability across a wide range of data scenarios. Looking forward, we aim to refine the pipeline by addressing host contamination issues, optimizing workflows for third-generation sequencing data, and integrating emerging technologies like deep learning and network analysis, to further enhance microbiome insights and data accuracy. EasyMetageonome is freely available at https://github.com/YongxinLiu/EasyMetagenome.
Collapse
Affiliation(s)
- Defeng Bai
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Tong Chen
- State Key Laboratory for Quality Ensurance and Sustainable Use of Dao‐di Herbs, National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical SciencesBeijingChina
| | - Jiani Xun
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Chuang Ma
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
- School of HorticultureAnhui Agricultural UniversityHefeiChina
| | - Hao Luo
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Haifei Yang
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
- College of Life SciencesQingdao Agricultural UniversityQingdaoChina
| | - Chen Cao
- Key Laboratory for Bio‐Electromagnetic Environment and Advanced Medical Theranostics, School of Biomedical Engineering and InformaticsNanjing Medical UniversityNanjingJiangsuChina
| | - Xiaofeng Cao
- Center for Water and Ecology, State Key Joint Laboratory of Environment Simulation and Pollution Control, School of EnvironmentTsinghua UniversityBeijingChina
| | - Jianzhou Cui
- Immunology Translational Research Programme, Yong Loo Lin School of MedicineNational University of SingaporeSingaporeSingapore
| | - Yuan‐Ping Deng
- Research Center for Parasites and Vectors, College of Veterinary MedicineHunan Agricultural UniversityChangshaHunanChina
| | - Zhaochao Deng
- Institute of Marine Biology and Pharmacology, Ocean CollegeZhejiang UniversityZhoushanZhejiangChina
| | - Wenxin Dong
- Agro‐Environmental Protection InstituteMinistry of Agriculture and Rural AffairsTianjinChina
| | - Wenxue Dong
- Key Laboratory for Molecular Genetic Mechanisms and Intervention Research on High Altitude Disease of Tibet Autonomous Region, School of MedicineXizang Minzu UniversityXianyangChina
| | - Juan Du
- Karolinska Institutet, Department of Microbiology, Tumor and Cell BiologyStockholmSweden
| | - Qunkai Fang
- College of EnvironmentZhejiang University of TechnologyHangzhouChina
| | - Wei Fang
- College of Environmental and Resource SciencesZhejiang Agriculture and Forestry UniversityHangzhouChina
| | - Yue Fang
- The College of ForestryBeijing Forestry UniversityBeijingChina
| | - Fangtian Fu
- Department of Bioinformatics, Hangzhou VicrobX Biotech Co., LtdHangzhouZhejiangChina
| | - Min Fu
- Anhui Province Key Laboratory of Integrated Pest Management on Crops, College of Plant ProtectionAnhui Agricultural UniversityHefeiChina
| | - Yi‐Tian Fu
- Xiangya School of Basic MedicineCentral South UniversityChangshaHunanChina
| | - He Gao
- Institute of Microbiology,Guangdong Academy of SciencesGuangzhouGuangdongChina
| | - Jingping Ge
- Engineering Research Center of Agricultural Microbiology Technology, Ministry of Education, School of Life SciencesHeilongjiang UniversityHarbinChina
| | - Qinglong Gong
- College of Animal Science and TechnologyJilin Agricultural UniversityChangchunJilinChina
| | - Lunda Gu
- Sansure Biotech IncorporationChangshaHunanChina
| | - Peng Guo
- School of Food Science and BiologyHebei University of Science and TechnologyShijiazhuangHebeiChina
| | - Yuhao Guo
- Engineering Research Center of Agricultural Microbiology Technology, Ministry of Education, School of Life SciencesHeilongjiang UniversityHarbinChina
| | - Tang Hai
- School of Life SciencesShanxi Datong UniversityDatongChina
| | - Hao Liu
- Department of Health & Environmental SciencesXi'an Jiaotong‐Liverpool UniversitySuzhouJiangsuChina
| | - Jieqiang He
- College of HorticultureNorthwest A&F UniversityYanglingShaanxiChina
| | - Zi‐Yang He
- School of Agriculture, Food and Ecosystem Sciences, Faculty of ScienceThe University of MelbourneVICAustralia
| | - Huiyu Hou
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Can Huang
- Graduate School of Frontier SciencesThe University of TokyoKashiwa‐shi, ChibaJapan
| | - Shuai Ji
- Institute of Biotechnology, Helsinki Institute of Life ScienceUniversity of HelsinkiHelsinkiFinland
| | | | - Gui‐Lai Jiang
- Suzhou Medical CollegeSoochow UniversitySuzhouJiangsuChina
| | - Lingjuan Jiang
- Biomarker Discovery and Validation Facility, Institute of Clinical Medicine, Peking Union Medical College HospitalBeijingChina
| | - Ling N. Jin
- Department of Civil and Environmental EngineeringThe Hong Kong Polytechnic UniversityHong KongChina
| | - Yuhe Kan
- College of Biology and OceanographyWeifang UniversityWeifangShandongChina
| | - Da Kang
- College of Environmental Science and EngineeringBeijing University of TechnologyBeijingChina
| | - Jin Kou
- College of Environmental and Municipal EngineeringLanzhou Jiaotong UniversityLanzhouChina
| | - Ka‐Lung Lam
- School of Life SciencesThe Chinese University of Hong KongShatin, Hong KongChina
| | - Changchao Li
- Department of Civil and Environmental EngineeringThe Hong Kong Polytechnic UniversityHong KongChina
| | - Chong Li
- Department of Renewable ResourcesUniversity of AlbertaEdmontonAlbertaCanada
| | - Fuyi Li
- School of Geographical SciencesNortheast Normal UniversityChangchunJilinChina
| | - Liwei Li
- Department of GastroenterologyThe Second Affiliated Hospital of Guangxi Medical UniversityNanningGuangxiChina
| | - Miao Li
- Synaura Biotechnology (Shanghai) Co., Ltd.ShanghaiChina
| | - Xin Li
- School of Public HealthUniversity of MichiganAnn ArborMichiganUSA
| | - Ye Li
- Institute of Soil Science, Chinese Academy of SciencesNanjingJiangsuChina
| | - Zheng‐Tao Li
- School of Art and Archaeology of Zhejiang UniversityZhejiangChina
| | - Jing Liang
- College of Animal Science and TechnologyGuangxi UniversityNanningChina
| | - Yongxin Lin
- Fujian Provincial Key Laboratory for Subtropical Resources and EnvironmentFujian Normal UniversityFuzhouChina
| | - Changzhen Liu
- College of Energy and Environmental EngineeringHebei University of EngineeringHandanHebeiChina
| | | | - Fengqin Liu
- College of Life SciencesHenan Agricultural UniversityZhengzhouChina
| | - Jia Liu
- College of Life ScienceNankai UniversityTianjinChina
| | - Tianrui Liu
- Jiangxi Province Key Laboratory of Sustainable Utilization of Traditional Chinese Medicine Resources, Institute of Traditional Chinese Medicine Health Industry, China Academy of Chinese Medical SciencesJiangxiChina
| | - Tingting Liu
- Beijing Key Laboratory of Emerging Infectious Diseases, Institute of Infectious Diseases, Beijing Ditan HospitalCapital Medical UniversityBeijingChina
| | - Xinyuan Liu
- State Key Laboratory of Tea Plant Biology and UtilizationAnhui Agricultural UniversityHefeiAnhuiChina
| | - Yaqun Liu
- School of Life Sciences and Food TechnologyHanshan Normal UniversityChaozhouChina
| | | | - Minghao Liu
- State Key Laboratory of Microbial Resources, Institute of Microbiology, Chinese Academy of SciencesBeijingChina
| | - Wenbo Lou
- College of Animal Science and TechnologyJilin Agricultural UniversityChangchunJilinChina
| | - Yaning Luan
- The College of ForestryBeijing Forestry UniversityBeijingChina
| | - Yuanyuan Luo
- State Key Laboratory of Tea Plant Biology and UtilizationAnhui Agricultural UniversityHefeiAnhuiChina
| | - Hujie Lv
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
- Department of Life Sciences, Imperial College of LondonLondonUK
| | - Tengfei Ma
- State Key Laboratory of Herbage Improvement and Grassland Agro‐Ecosystems, Centre for Grassland Microbiome, College of Pastoral Agriculture Science and TechnologyLanzhou UniversityLanzhouGansuChina
| | - Zongjiong Mai
- Department of OncologyThe Fifth Affiliated Hospital of Sun Yat‐sen UniversityZhuhaiGuangdongChina
| | - Jiayuan Mo
- College of Animal Science and TechnologyGuangxi UniversityNanningChina
| | - Dongze Niu
- National‐Local Joint Engineering Research Center of Biomass Refining and High‐Quality Utilization, Institute of Urban and Rural MiningChangzhou UniversityChangzhouJiangsuChina
| | - Zhuo Pan
- Department of PathologyAffiliated Cancer Hospital of Zhengzhou UniversityZhengzhouChina
| | - Heyuan Qi
- Institute of Microbiology, Chinese Academy of SciencesBeijingChina
| | - Zhanyao Shi
- College of Water SciencesBeijing Normal UniversityBeijingChina
| | | | - Fuxiang Sun
- New Direction Biotechnology (Tianjin) Co., LtdTianjinChina
| | - Yan Sun
- College of Energy and Environmental Engineering, Hebei Key Laboratory of Air Pollution Cause and ImpactHebei University of EngineeringHandanChina
| | - Sihui Tian
- Institute of Botany, Chinese Academy of SciencesBeijingChina
| | - Xiulin Wan
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Guoliang Wang
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry SciencesBeijingChina
| | - Hongyang Wang
- National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical SciencesJiangsuChina
| | - Hongyu Wang
- College of Animal ScienceAnhui Science and Technology UniversityChuzhouChina
| | - Huanhuan Wang
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Chinese Academy of Agricultural SciencesBeijingChina
| | - Jing Wang
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental SciencesBeijingChina
| | - Jun Wang
- China CDC Key Laboratory of Environment and Population Health, National Institute of Environmental Health, Chinese Center for Disease Control and PreventionBeijingChina
| | - Kang Wang
- College of Animal Science and TechnologyYangzhou UniversityYangzhouJiangsuChina
| | - Leli Wang
- Key Laboratory of Agro‐Ecological Processes in Subtropical Region, Institute of Subtropical Agriculture, Chinese Academy of SciencesChangshaChina
| | - Shao‐kun Wang
- Institute of Ecological Conservation and Restoration, Chinese Academy of ForestryBeijingChina
| | - Xinlong Wang
- Beijing Key Laboratory of Emerging Infectious Diseases, Institute of Infectious Diseases, Beijing Ditan HospitalCapital Medical UniversityBeijingChina
| | - Yao Wang
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Zufei Xiao
- State Key Laboratory for Ecological Security of Regions and Cities, Institute of Urban Environment, Chinese Academy of SciencesXiamenChina
| | - Huichun Xing
- Center of Liver Diseases Division 3, Beijing Ditan HospitalCapital Medical UniversityBeijingChina
| | - Yifan Xu
- Center of Liver Diseases Division 3, Beijing Ditan HospitalCapital Medical UniversityBeijingChina
| | - Shu‐yan Yan
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Key Laboratory of Invasive Alien Species Control of Ministry of Agriculture and Rural Affairs, Institute of Plant Protection, Chinese Academy of Agricultural SciencesBeijingChina
| | - Li Yang
- Sansure Biotech IncorporationChangshaHunanChina
| | - Song Yang
- Center of Liver Diseases Division 3, Beijing Ditan HospitalCapital Medical UniversityBeijingChina
| | - Yuanming Yang
- Guangzhou University of Chinese MedicineGuangzhouChina
| | - Xiaofang Yao
- Key Laboratory of Agro‐Ecological Processes in Subtropical Region, Institute of Subtropical Agriculture, Chinese Academy of SciencesChangshaChina
| | - Salsabeel Yousuf
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Hao Yu
- Institute of Marine Biology and Pharmacology, Ocean CollegeZhejiang UniversityZhoushanZhejiangChina
| | - Yu Lei
- Key Laboratory of Livestock BiologyNorthwest A&F UniversityYanglingShaanxiChina
| | - Zhengrong Yuan
- College of Biological Sciences and TechnologyBeijing Forestry UniversityBeijingChina
| | - Meiyin Zeng
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Chunfang Zhang
- Institute of Marine Biology and Pharmacology, Ocean CollegeZhejiang UniversityZhoushanZhejiangChina
| | - Chunge Zhang
- CAS Key Laboratory of Pathogenic Microbiology and Immunology, Institute of Microbiology, Chinese Academy of SciencesBeijingChina
| | - Huimin Zhang
- School of Food Science and TechnologyShihezi UniversityShiheziXinjiangChina
| | | | - Na Zhang
- College of Biochemical EngineeringBeijing Union UniversityBeijingChina
| | - Tianyuan Zhang
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Yi‐Bo Zhang
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Key Laboratory of Invasive Alien Species Control of Ministry of Agriculture and Rural Affairs, Institute of Plant Protection, Chinese Academy of Agricultural SciencesBeijingChina
| | - Yupeng Zhang
- College of Resources and Environmental SciencesHenan Agricultural UniversityZhengzhouChina
| | - Zheng Zhang
- Tea Research Institute, Chinese Academy of Agricultural SciencesHangzhouZhejiangChina
| | - Mingda Zhou
- College of Environmental Science and EngineeringTongji UniversityShanghaiChina
| | - Yuanping Zhou
- Zhanjiang Key Laboratory of Human Microecology and Clinical Translation Research, the Marine Biomedical Research Institute, College of Basic MedicineGuangdong Medical UniversityZhanjiangGuangdongChina
| | - Chengshuai Zhu
- School of Art and Archaeology of Zhejiang UniversityZhejiangChina
| | - Lin Zhu
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of TechnologyHarbinChina
| | - Yue Zhu
- School of Ecology, Environment and ResourcesGuangdong University of TechnologyGuangzhouGuangdongChina
| | - Zhihao Zhu
- Zhanjiang Key Laboratory of Human Microecology and Clinical Translation Research, the Marine Biomedical Research Institute, College of Basic MedicineGuangdong Medical UniversityZhanjiangGuangdongChina
| | - Hongqin Zou
- Institute of Agricultural Resources and Regional Planning, Chinese Academy of Agricultural SciencesBeijingChina
| | - Anna Zuo
- School of Traditional Chinese MedicineSouthern Medical UniversityGuangzhouGuangdongChina
| | - Wenxuan Dong
- Department of Animal SciencesPurdue UniversityWest LafayetteIndianaUSA
| | - Tao Wen
- College of Resource and Environmental SciencesNanjing Agricultural UniversityNanjingJiangsuChina
| | - Shifu Chen
- HaploX BiotechnologyShenzhenChina
- LifeX Institute, School of Medical TechnologyGannan Medical UniversityGanzhouChina
- Faculty of Data ScienceCity University of MacauMacauChina
| | - Guoliang Li
- Jiangxi Provincial Key Laboratory of Conservation Biology, College of ForestryJiangxi Agricultural UniversityNanchangJiangxiChina
| | - Yunyun Gao
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Yong‐Xin Liu
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| |
Collapse
|
9
|
Sullivan D, Hjörleifsson K, Swarna N, Oakes C, Holley G, Melsted P, Pachter L. Accurate quantification of nascent and mature RNAs from single-cell and single-nucleus RNA-seq. Nucleic Acids Res 2025; 53:gkae1137. [PMID: 39657125 PMCID: PMC11724275 DOI: 10.1093/nar/gkae1137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 10/28/2024] [Accepted: 12/05/2024] [Indexed: 12/14/2024] Open
Abstract
In single-cell and single-nucleus RNA sequencing (RNA-seq), the coexistence of nascent (unprocessed) and mature (processed) messenger RNA (mRNA) poses challenges in accurate read mapping and the interpretation of count matrices. The traditional transcriptome reference, defining the "region of interest" in bulk RNA-seq, restricts its focus to mature mRNA transcripts. This restriction leads to two problems: reads originating outside of the "region of interest" are prone to mismapping within this region, and additionally, such external reads cannot be matched to specific transcript targets. Expanding the "region of interest" to encompass both nascent and mature mRNA transcript targets provides a more comprehensive framework for RNA-seq analysis. Here, we introduce the concept of distinguishing flanking k-mers (DFKs) to improve mapping of sequencing reads. We have developed an algorithm to identify DFKs, which serve as a sophisticated "background filter", enhancing the accuracy of mRNA quantification. This dual strategy of an expanded region of interest coupled with the use of DFKs enhances the precision in quantifying both mature and nascent mRNA molecules, as well as in delineating reads of ambiguous status.
Collapse
Affiliation(s)
- Delaney K Sullivan
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, 885 Tiverton Drive, Los Angeles, CA 90095, USA
| | - Kristján Eldjárn Hjörleifsson
- Department of Computing and Mathematical Sciences, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Nikhila P Swarna
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Conrad Oakes
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Guillaume Holley
- deCODE Genetics/Amgen Inc., Sturlugata 8, 101 Reykjavík, Iceland
| | - Páll Melsted
- deCODE Genetics/Amgen Inc., Sturlugata 8, 101 Reykjavík, Iceland
- School of Engineering and Natural Sciences, University of Iceland, Sæmundargata 2, 102 Reykjavík, Iceland
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
- Department of Computing and Mathematical Sciences, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| |
Collapse
|
10
|
Dubois B, Delitte M, Lengrand S, Bragard C, Legrève A, Debode F. PRONAME: a user-friendly pipeline to process long-read nanopore metabarcoding data by generating high-quality consensus sequences. FRONTIERS IN BIOINFORMATICS 2024; 4:1483255. [PMID: 39758955 PMCID: PMC11695402 DOI: 10.3389/fbinf.2024.1483255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Accepted: 11/27/2024] [Indexed: 01/07/2025] Open
Abstract
Background The study of sample taxonomic composition has evolved from direct observations and labor-intensive morphological studies to different DNA sequencing methodologies. Most of these studies leverage the metabarcoding approach, which involves the amplification of a small taxonomically-informative portion of the genome and its subsequent high-throughput sequencing. Recent advances in sequencing technology brought by Oxford Nanopore Technologies have revolutionized the field, enabling portability, affordable cost and long-read sequencing, therefore leading to a significant increase in taxonomic resolution. However, Nanopore sequencing data exhibit a particular profile, with a higher error rate compared with Illumina sequencing, and existing bioinformatics pipelines for the analysis of such data are scarce and often insufficient, requiring specialized tools to accurately process long-read sequences. Results We present PRONAME (PROcessing NAnopore MEtabarcoding data), an open-source, user-friendly pipeline optimized for processing raw Nanopore sequencing data. PRONAME includes precompiled databases for complete 16S sequences (Silva138 and Greengenes2) and a newly developed and curated database dedicated to bacterial 16S-ITS-23S operon sequences. The user can also provide a custom database if desired, therefore enabling the analysis of metabarcoding data for any domain of life. The pipeline significantly improves sequence accuracy, implementing innovative error-correction strategies and taking advantage of the new sequencing chemistry to produce high-quality duplex reads. Evaluations using a mock community have shown that PRONAME delivers consensus sequences demonstrating at least 99.5% accuracy with standard settings (and up to 99.7%), making it a robust tool for genomic analysis of complex multi-species communities. Conclusion PRONAME meets the challenges of long-read Nanopore data processing, offering greater accuracy and versatility than existing pipelines. By integrating Nanopore-specific quality filtering, clustering and error correction, PRONAME produces high-precision consensus sequences. This brings the accuracy of Nanopore sequencing close to that of Illumina sequencing, while taking advantage of the benefits of long-read technologies.
Collapse
Affiliation(s)
- Benjamin Dubois
- Bioengineering Unit, Life Sciences Department, Walloon Agricultural Research Centre, Gembloux, Belgium
| | - Mathieu Delitte
- Earth and Life Institute – Applied Microbiology, Plant Health, UCLouvain, Louvain-la-Neuve, Belgium
| | - Salomé Lengrand
- Earth and Life Institute – Applied Microbiology, Plant Health, UCLouvain, Louvain-la-Neuve, Belgium
| | - Claude Bragard
- Earth and Life Institute – Applied Microbiology, Plant Health, UCLouvain, Louvain-la-Neuve, Belgium
| | - Anne Legrève
- Earth and Life Institute – Applied Microbiology, Plant Health, UCLouvain, Louvain-la-Neuve, Belgium
| | - Frédéric Debode
- Bioengineering Unit, Life Sciences Department, Walloon Agricultural Research Centre, Gembloux, Belgium
| |
Collapse
|
11
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
12
|
Li H, Feng Y, Xu Y, Li T, Li Q, Lin W, Ni W, Yang J, Mao W, Wang Z, Xing H. Characterization of a novel HIV-1 second-generation circulating recombinant form (CRF172_0755) among men who have sex with men in China. J Infect 2024; 89:106345. [PMID: 39489180 DOI: 10.1016/j.jinf.2024.106345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2024] [Revised: 10/28/2024] [Accepted: 10/29/2024] [Indexed: 11/05/2024]
Affiliation(s)
- Huan Li
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, National Center for AIDS/STD Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, China
| | - Yi Feng
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, National Center for AIDS/STD Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, China
| | - Yang Xu
- MGI Tech, Shenzhen 518083, China
| | - Tang Li
- MGI Tech, Shenzhen 518083, China
| | - Qi Li
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, National Center for AIDS/STD Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, China
| | - Wei Lin
- BGI Research, Hangzhou 310030, China; BGI Hangzhou CycloneSEQ Technology Co., Ltd., Hangzhou 310030, China
| | - Wanqi Ni
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, National Center for AIDS/STD Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, China
| | | | | | - Zheng Wang
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, National Center for AIDS/STD Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, China.
| | - Hui Xing
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, National Center for AIDS/STD Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, China.
| |
Collapse
|
13
|
Carneiro AD, Schaffer DV. Engineering novel adeno-associated viruses (AAVs) for improved delivery in the nervous system. Curr Opin Chem Biol 2024; 83:102532. [PMID: 39342684 DOI: 10.1016/j.cbpa.2024.102532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 08/27/2024] [Accepted: 09/03/2024] [Indexed: 10/01/2024]
Abstract
Harnessing adeno-associated virus (AAV) vectors for therapeutic gene delivery has emerged as a progressively promising strategy to treat disorders of both the central nervous system (CNS) and peripheral nervous system (PNS), and there are many ongoing clinical trials. However, unique physiological and molecular characteristics of the CNS and PNS pose obstacles to efficient vector delivery, ranging from the blood-brain barrier to the diverse nature of nervous system disorders. Engineering novel AAV capsids may help overcome these ongoing challenges and maximize therapeutic transgene delivery. This article discusses strategies for innovative AAV capsid development, highlighting recent advances. Notably, advances in next generation sequencing and machine learning have sparked new approaches for capsid investigation and engineering. Furthermore, we outline future directions and additional challenges in AAV-mediated gene therapy in the CNS and PNS.
Collapse
Affiliation(s)
- Ana D Carneiro
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, CA 94720, USA
| | - David V Schaffer
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, CA 94720, USA; Department of Bioengineering, University of California, Berkeley, CA 94720, USA; Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA; California Institute for Quantitative Biosciences, University of California, Berkeley, CA 94720, USA; Helen Wills Neuroscience Institute, University of California, Berkeley, CA 94720, USA.
| |
Collapse
|
14
|
Zhu XT, Sanz-Jimenez P, Ning XT, Tahir Ul Qamar M, Chen LL. Direct RNA sequencing in plants: Practical applications and future perspectives. PLANT COMMUNICATIONS 2024; 5:101064. [PMID: 39155503 PMCID: PMC11589328 DOI: 10.1016/j.xplc.2024.101064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 07/17/2024] [Accepted: 08/14/2024] [Indexed: 08/20/2024]
Abstract
The transcriptome serves as a bridge that links genomic variation to phenotypic diversity. A vast number of studies using next-generation RNA sequencing (RNA-seq) over the last 2 decades have emphasized the essential roles of the plant transcriptome in response to developmental and environmental conditions, providing numerous insights into the dynamic changes, evolutionary traces, and elaborate regulation of the plant transcriptome. With substantial improvement in accuracy and throughput, direct RNA sequencing (DRS) has emerged as a new and powerful sequencing platform for precise detection of native and full-length transcripts, overcoming many limitations such as read length and PCR bias that are inherent to short-read RNA-seq. Here, we review recent advances in dissecting the complexity and diversity of plant transcriptomes using DRS as the main technological approach, covering many aspects of RNA metabolism, including novel isoforms, poly(A) tails, and RNA modification, and we propose a comprehensive workflow for processing of plant DRS data. Many challenges to the application of DRS in plants, such as the need for machine learning tools tailored to plant transcriptomes, remain to be overcome, and together we outline future biological questions that can be addressed by DRS, such as allele-specific RNA modification. This technology provides convenient support on which the connection of distinct RNA features is tightly built, sustainably refining our understanding of the biological functions of the plant transcriptome.
Collapse
Affiliation(s)
- Xi-Tong Zhu
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, College of Life Science and Technology, Guangxi University, Nanning 530004, China.
| | - Pablo Sanz-Jimenez
- National Key Laboratory of Crop Genetic Improvement, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Xiao-Tong Ning
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, College of Life Science and Technology, Guangxi University, Nanning 530004, China
| | - Muhammad Tahir Ul Qamar
- Integrative Omics and Molecular Modeling Laboratory, Department of Bioinformatics and Biotechnology, Government College University Faisalabad (GCUF), Faisalabad 38000, Pakistan
| | - Ling-Ling Chen
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, College of Life Science and Technology, Guangxi University, Nanning 530004, China.
| |
Collapse
|
15
|
Smith GJ, van Alen TA, van Kessel MA, Lücker S. Simple, reference-independent assessment to empirically guide correction and polishing of hybrid microbial community metagenomic assembly. PeerJ 2024; 12:e18132. [PMID: 39529629 PMCID: PMC11552494 DOI: 10.7717/peerj.18132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 08/29/2024] [Indexed: 11/16/2024] Open
Abstract
Hybrid metagenomic assembly of microbial communities, leveraging both long- and short-read sequencing technologies, is becoming an increasingly accessible approach, yet its widespread application faces several challenges. High-quality references may not be available for assembly accuracy comparisons common for benchmarking, and certain aspects of hybrid assembly may benefit from dataset-dependent, empiric guidance rather than the application of a uniform approach. In this study, several simple, reference-free characteristics-particularly coding gene content and read recruitment profiles-were hypothesized to be reliable indicators of assembly quality improvement during iterative error-fixing processes. These characteristics were compared to reference-dependent genome- and gene-centric analyses common for microbial community metagenomic studies. Two laboratory-scale bioreactors were sequenced with short- and long-read platforms, and assembled with commonly used software packages. Following long read assembly, long read correction and short read polishing were iterated up to ten times to resolve errors. These iterative processes were shown to have a substantial effect on gene- and genome-centric community compositions. Simple, reference-free assembly characteristics, specifically changes in gene fragmentation and short read recruitment, were robustly correlated with advanced analyses common in published comparative studies, and therefore are suitable proxies for hybrid metagenome assembly quality to simplify the identification of the optimal number of correction and polishing iterations. As hybrid metagenomic sequencing approaches will likely remain relevant due to the low added cost of short-read sequencing for differential coverage binning or the ability to access lower abundance community members, it is imperative that users are equipped to estimate assembly quality prior to downstream analyses.
Collapse
Affiliation(s)
- Garrett J. Smith
- Department of Microbiology, The Ohio State University, Columbus, OH, United States of America
- Department of Microbiology, Radboud Institute for Biological and Environmental Sciences, Radboud University, Nijmegen, Netherlands
| | - Theo A. van Alen
- Department of Microbiology, Radboud Institute for Biological and Environmental Sciences, Radboud University, Nijmegen, Netherlands
| | - Maartje A.H.J. van Kessel
- Department of Microbiology, Radboud Institute for Biological and Environmental Sciences, Radboud University, Nijmegen, Netherlands
| | - Sebastian Lücker
- Department of Microbiology, Radboud Institute for Biological and Environmental Sciences, Radboud University, Nijmegen, Netherlands
| |
Collapse
|
16
|
Guizar Amador MF, Darragh K, Liu JW, Dean C, Bogarín D, Pérez-Escobar OA, Serracín Z, Pupulin F, Ramírez SR. The Gongora gibba genome assembly provides new insights into the evolution of floral scent in male euglossine bee-pollinated orchids. G3 (BETHESDA, MD.) 2024; 14:jkae211. [PMID: 39231006 PMCID: PMC11540329 DOI: 10.1093/g3journal/jkae211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Accepted: 08/22/2024] [Indexed: 09/06/2024]
Abstract
Orchidaceae is one of the most prominent flowering plant families, with many species exhibiting highly specialized reproductive and ecological adaptations. An estimated 10% of orchid species in the American tropics are pollinated by scent-collecting male euglossine bees; however, to date, there are no published genomes of species within this pollination syndrome. In this study, we present the first draft genome of an epiphytic orchid from the genus Gongora, a representative of the male euglossine bee-pollinated subtribe Stanhopeinae. The 1.83-Gb de novo genome with a scaffold N50 of 1.7 Mb was assembled using short- and long-read sequencing and chromosome capture (Hi-C) information. Over 17,000 genes were annotated, and 82.95% of the genome was identified as repetitive content. Furthermore, we identified and manually annotated 26 terpene synthase genes linked to floral scent biosynthesis and performed a phylogenetic analysis with other published orchid terpene synthase genes. The Gongora gibba genome assembly will serve as the foundation for future research to understand the genetic basis of floral scent biosynthesis and diversification in orchids.
Collapse
Affiliation(s)
| | - Kathy Darragh
- Department of Evolution and Ecology, University of California, Davis, Davis, CA 95616, USA
| | - Jasen W Liu
- Department of Evolution and Ecology, University of California, Davis, Davis, CA 95616, USA
| | - Cheryl Dean
- Department of Evolution and Ecology, University of California, Davis, Davis, CA 95616, USA
| | - Diego Bogarín
- Lankester Botanical Garden, University of Costa Rica, P.O. Box 302-7050, Cartago 30109, Costa Rica
- Evolutionary Ecology Group, Naturalis Biodiversity Center, 2333 CR Leiden, The Netherlands
| | - Oscar A Pérez-Escobar
- Lankester Botanical Garden, University of Costa Rica, P.O. Box 302-7050, Cartago 30109, Costa Rica
- Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3AE, UK
| | - Zuleika Serracín
- Herbario UCH, Universidad Autónoma de Chiriquí, P.O. Box 0427, David, Chiriquí 0427, Panamá
| | - Franco Pupulin
- Lankester Botanical Garden, University of Costa Rica, P.O. Box 302-7050, Cartago 30109, Costa Rica
| | - Santiago R Ramírez
- Department of Evolution and Ecology, University of California, Davis, Davis, CA 95616, USA
- Lankester Botanical Garden, University of Costa Rica, P.O. Box 302-7050, Cartago 30109, Costa Rica
| |
Collapse
|
17
|
Kang X, Zhang W, Li Y, Luo X, Schönhuth A. HyLight: Strain aware assembly of low coverage metagenomes. Nat Commun 2024; 15:8665. [PMID: 39375348 PMCID: PMC11458758 DOI: 10.1038/s41467-024-52907-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 09/23/2024] [Indexed: 10/09/2024] Open
Abstract
Different strains of identical species can vary substantially in terms of their spectrum of biomedically relevant phenotypes. Reconstructing the genomes of microbial communities at the level of their strains poses significant challenges, because sequencing errors can obscure strain-specific variants. Next-generation sequencing (NGS) reads are too short to resolve complex genomic regions. Third-generation sequencing (TGS) reads, although longer, are prone to higher error rates or substantially more expensive. Limiting TGS coverage to reduce costs compromises the accuracy of the assemblies. This explains why prior approaches agree on losses in strain awareness, accuracy, tendentially excessive costs, or combinations thereof. We introduce HyLight, a metagenome assembly approach that addresses these challenges by implementing the complementary strengths of TGS and NGS data. HyLight employs strain-resolved overlap graphs (OG) to accurately reconstruct individual strains within microbial communities. Our experiments demonstrate that HyLight produces strain-aware and contiguous assemblies at minimal error content, while significantly reducing costs because utilizing low-coverage TGS data. HyLight achieves an average improvement of 19.05% in preserving strain identity and demonstrates near-complete strain awareness across diverse datasets. In summary, HyLight offers considerable advances in metagenome assembly, insofar as it delivers significantly enhanced strain awareness, contiguity, and accuracy without the typical compromises observed in existing approaches.
Collapse
Affiliation(s)
- Xiongbin Kang
- College of Biology, Hunan University, Changsha, China
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Wenhai Zhang
- College of Biology, Hunan University, Changsha, China
| | - Yichen Li
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Xiao Luo
- College of Biology, Hunan University, Changsha, China.
| | - Alexander Schönhuth
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
18
|
Lok S, Lau TNH, Trost B, Tong AHY, Paton T, Wintle RF, Engstrom MD, Gunn A, Scherer SW. Chromosomal-level reference genome assembly of muskox (Ovibos moschatus) from Banks Island in the Canadian Arctic, a resource for conservation genomics. Sci Rep 2024; 14:21023. [PMID: 39284808 PMCID: PMC11405533 DOI: 10.1038/s41598-024-67270-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Accepted: 07/09/2024] [Indexed: 09/20/2024] Open
Abstract
The muskox (Ovibos moschatus), an integral component and iconic symbol of arctic biocultural diversity, is under threat by rapid environmental disruptions from climate change. We report a chromosomal-level haploid genome assembly of a muskox from Banks Island in the Canadian Arctic Archipelago. The assembly has a contig N50 of 44.7 Mbp, a scaffold N50 of 112.3 Mbp, a complete representation (100%) of the BUSCO v5.2.2 set of 9225 mammalian marker genes and is anchored to the 24 chromosomes of the muskox. Tabulation of heterozygous single nucleotide variants in our specimen revealed a very low level of genetic diversity, which is consistent with recent reports of the muskox having the lowest genome-wide heterozygosity among the ungulates. While muskox populations are currently showing no overt signs of inbreeding depression, environmental disruptions are expected to strain the genomic resilience of the species. One notable impact of rapid climate change in the Arctic is the spread of emerging infectious and parasitic diseases in the muskox, as exemplified by the range expansion of muskox lungworms, and the recent fatal outbreaks of Erysipelothrix rhusiopathiae, a pathogen normally associated with domestic swine and poultry. As a genomics resource for conservation management of the muskox against existing and emerging disease modalities, we annotated the genes of the major histocompatibility complex on chromosome 2 and performed an initial assessment of the genetic diversity of this complex. This resource is further supported by the annotation of the principal genes of the innate immunity system, genes that are rapidly evolving and under positive selection in the muskox, genes associated with environmental adaptations, and the genes associated with socioeconomic benefits for Arctic communities such as wool (qiviut) attributes. These annotations will benefit muskox management and conservation.
Collapse
Affiliation(s)
- Si Lok
- The Centre for Applied Genomics, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, 686 Bay Street, Rm 13.9713, Suite 03-6577, Toronto, ON, M5G 0A4, Canada.
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada.
| | - Timothy N H Lau
- The Centre for Applied Genomics, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, 686 Bay Street, Rm 13.9713, Suite 03-6577, Toronto, ON, M5G 0A4, Canada
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Brett Trost
- The Centre for Applied Genomics, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, 686 Bay Street, Rm 13.9713, Suite 03-6577, Toronto, ON, M5G 0A4, Canada
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
- Program in Molecular Medicine, The Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Amy H Y Tong
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, M5S 3E1, Canada
| | - Tara Paton
- The Centre for Applied Genomics, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, 686 Bay Street, Rm 13.9713, Suite 03-6577, Toronto, ON, M5G 0A4, Canada
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Richard F Wintle
- The Centre for Applied Genomics, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, 686 Bay Street, Rm 13.9713, Suite 03-6577, Toronto, ON, M5G 0A4, Canada
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Mark D Engstrom
- Department of Natural History, Royal Ontario Museum, Toronto, ON, M5S 2C6, Canada
| | | | - Stephen W Scherer
- The Centre for Applied Genomics, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, 686 Bay Street, Rm 13.9713, Suite 03-6577, Toronto, ON, M5G 0A4, Canada.
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada.
- McLaughlin Centre, University of Toronto, Toronto, ON, M5G 0A4, Canada.
- Department of Molecular Genetics, Faculty of Medicine, University of Toronto, Toronto, ON, M5S 1A8, Canada.
| |
Collapse
|
19
|
Li H, Durbin R. Genome assembly in the telomere-to-telomere era. Nat Rev Genet 2024; 25:658-670. [PMID: 38649458 DOI: 10.1038/s41576-024-00718-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2024] [Indexed: 04/25/2024]
Abstract
Genome sequences largely determine the biology and encode the history of an organism, and de novo assembly - the process of reconstructing the genome sequence of an organism from sequencing reads - has been a central problem in bioinformatics for four decades. Until recently, genomes were typically assembled into fragments of a few megabases at best, but now technological advances in long-read sequencing enable the near-complete assembly of each chromosome - also known as telomere-to-telomere assembly - for many organisms. Here, we review recent progress on assembly algorithms and protocols, with a focus on how to derive near-telomere-to-telomere assemblies. We also discuss the additional developments that will be required to resolve remaining assembly gaps and to assemble non-diploid genomes.
Collapse
Affiliation(s)
- Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Richard Durbin
- Department of Genetics, Cambridge University, Cambridge, UK.
| |
Collapse
|
20
|
Anthony WE, Allison SD, Broderick CM, Chavez Rodriguez L, Clum A, Cross H, Eloe-Fadrosh E, Evans S, Fairbanks D, Gallery R, Gontijo JB, Jones J, McDermott J, Pett-Ridge J, Record S, Rodrigues JLM, Rodriguez-Reillo W, Shek KL, Takacs-Vesbach T, Blanchard JL. From soil to sequence: filling the critical gap in genome-resolved metagenomics is essential to the future of soil microbial ecology. ENVIRONMENTAL MICROBIOME 2024; 19:56. [PMID: 39095861 PMCID: PMC11295382 DOI: 10.1186/s40793-024-00599-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/12/2024] [Accepted: 07/22/2024] [Indexed: 08/04/2024]
Abstract
Soil microbiomes are heterogeneous, complex microbial communities. Metagenomic analysis is generating vast amounts of data, creating immense challenges in sequence assembly and analysis. Although advances in technology have resulted in the ability to easily collect large amounts of sequence data, soil samples containing thousands of unique taxa are often poorly characterized. These challenges reduce the usefulness of genome-resolved metagenomic (GRM) analysis seen in other fields of microbiology, such as the creation of high quality metagenomic assembled genomes and the adoption of genome scale modeling approaches. The absence of these resources restricts the scale of future research, limiting hypothesis generation and the predictive modeling of microbial communities. Creating publicly available databases of soil MAGs, similar to databases produced for other microbiomes, has the potential to transform scientific insights about soil microbiomes without requiring the computational resources and domain expertise for assembly and binning.
Collapse
Affiliation(s)
| | - Steven D Allison
- University of California Irvine, Irvine, CA, USA
- Department of Earth System Science, University of California, Irvine, CA, USA
| | - Caitlin M Broderick
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, USA
| | | | - Alicia Clum
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Hugh Cross
- National Ecological Observatory Network - Battelle, Boulder, CO, USA
| | | | - Sarah Evans
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, USA
| | - Dawson Fairbanks
- University of California Riverside, Riverside, CA, USA
- The University of Arizona, Tucson, AZ, USA
| | | | | | - Jennifer Jones
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, USA
| | - Jason McDermott
- Pacific Northwest National Laboratory, Richland, WA, 99354, USA
| | - Jennifer Pett-Ridge
- Lawrence Livermore National Laboratory, Livermore, CA, USA
- Life & Environmental Sciences Department, University of California Merced, Merced, CA, 95343, USA
| | | | | | | | | | | | | |
Collapse
|
21
|
Djeghout B, Le-Viet T, Martins LDO, Savva GM, Evans R, Baker D, Page A, Elumogo N, Wain J, Janecko N. Capturing clinically relevant Campylobacter attributes through direct whole genome sequencing of stool. Microb Genom 2024; 10:001284. [PMID: 39213166 PMCID: PMC11570993 DOI: 10.1099/mgen.0.001284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2024] [Accepted: 07/31/2024] [Indexed: 09/04/2024] Open
Abstract
Campylobacter is the leading bacterial cause of infectious intestinal disease, but the pathogen typically accounts for a very small proportion of the overall stool microbiome in each patient. Diagnosis is even more difficult due to the fastidious nature of Campylobacter in the laboratory setting. This has, in part, driven a change in recent years, from culture-based to rapid PCR-based diagnostic assays which have improved diagnostic detection, whilst creating a knowledge gap in our clinical and epidemiological understanding of Campylobacter genotypes - no isolates to sequence. In this study, direct metagenomic sequencing approaches were used to assess the possibility of replacing genome sequences with metagenome sequences; metagenomic sequencing outputs were used to describe clinically relevant attributes of Campylobacter genotypes. A total of 37 diarrhoeal stool samples with Campylobacter and five samples with an unknown pathogen result were collected and processed with and without filtration, DNA was extracted, and metagenomes were sequenced by short-read sequencing. Culture-based methods were used to validate Campylobacter metagenome-derived genome (MDG) results. Sequence output metrics were assessed for Campylobacter genome quality and accuracy of characterization. Of the 42 samples passing quality checks for analysis, identification of Campylobacter to the genus and species level was dependent on Campylobacter genome read count, coverage and genome completeness. A total of 65% (24/37) of samples were reliably identified to the genus level through Campylobacter MDG, 73% (27/37) by culture and 97% (36/37) by qPCR. The Campylobacter genomes with a genome completeness of over 60% (n=21) were all accurately identified at the species level (100%). Of those, 72% (15/21) were identified to sequence types (STs), and 95% (20/21) accurately identified antimicrobial resistance (AMR) gene determinants. Filtration of stool samples enhanced Campylobacter MDG recovery and genome quality metrics compared to the corresponding unfiltered samples, which improved the identification of STs and AMR profiles. The phylogenetic analysis in this study demonstrated the clustering of the metagenome-derived with culture-derived genomes and revealed the reliability of genomes from direct stool sequencing. Furthermore, Campylobacter genome spiking percentages ranging from 0 to 2% total metagenome abundance in the ONT MinION sequencer, configured to adaptive sequencing, exhibited better assembly quality and accurate identification of STs, particularly in the analysis of metagenomes containing 2 and 1% of Campylobacter jejuni genomes. Direct sequencing of Campylobacter from stool samples provides clinically relevant and epidemiologically important genomic information without the reliance on cultured genomes.
Collapse
Affiliation(s)
- Bilal Djeghout
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | - Thanh Le-Viet
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | | | - George M. Savva
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | - Rhiannon Evans
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | - David Baker
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | - Andrew Page
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | - Ngozi Elumogo
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
- Eastern Pathology Alliance, Norfolk and Norwich University Hospital, Norwich NR4 7UY, UK
| | - John Wain
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
- Norwich Medical School, University of East Anglia, Norwich NR4 7TJ, UK
| | - Nicol Janecko
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| |
Collapse
|
22
|
Shelton WJ, Zandpazandi S, Nix JS, Gokden M, Bauer M, Ryan KR, Wardell CP, Vaske OM, Rodriguez A. Long-read sequencing for brain tumors. Front Oncol 2024; 14:1395985. [PMID: 38915364 PMCID: PMC11194609 DOI: 10.3389/fonc.2024.1395985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 05/27/2024] [Indexed: 06/26/2024] Open
Abstract
Brain tumors and genomics have a long-standing history given that glioblastoma was the first cancer studied by the cancer genome atlas. The numerous and continuous advances through the decades in sequencing technologies have aided in the advanced molecular characterization of brain tumors for diagnosis, prognosis, and treatment. Since the implementation of molecular biomarkers by the WHO CNS in 2016, the genomics of brain tumors has been integrated into diagnostic criteria. Long-read sequencing, also known as third generation sequencing, is an emerging technique that allows for the sequencing of longer DNA segments leading to improved detection of structural variants and epigenetics. These capabilities are opening a way for better characterization of brain tumors. Here, we present a comprehensive summary of the state of the art of third-generation sequencing in the application for brain tumor diagnosis, prognosis, and treatment. We discuss the advantages and potential new implementations of long-read sequencing into clinical paradigms for neuro-oncology patients.
Collapse
Affiliation(s)
- William J Shelton
- Department of Neurosurgery, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Sara Zandpazandi
- Department of Neurosurgery, Medical University of South Carolina, Charleston, SC, United States
| | - J Stephen Nix
- Department of Pathology, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Murat Gokden
- Department of Pathology, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Michael Bauer
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Katie Rose Ryan
- Department of Biochemistry and Molecular Biology, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Christopher P Wardell
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Olena Morozova Vaske
- Department of Molecular, Cell and Developmental Biology, University of California Santa Cruz, Santa Cruz, CA, United States
| | - Analiz Rodriguez
- Department of Neurosurgery, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| |
Collapse
|
23
|
Wang R, Chen J. NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning. BMC Genomics 2024; 25:573. [PMID: 38849740 PMCID: PMC11157743 DOI: 10.1186/s12864-024-10446-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Accepted: 05/22/2024] [Indexed: 06/09/2024] Open
Abstract
BACKGROUNDS The single-pass long reads generated by third-generation sequencing technology exhibit a higher error rate. However, the circular consensus sequencing (CCS) produces shorter reads. Thus, it is effective to manage the error rate of long reads algorithmically with the help of the homologous high-precision and low-cost short reads from the Next Generation Sequencing (NGS) technology. METHODS In this work, a hybrid error correction method (NmTHC) based on a generative neural machine translation model is proposed to automatically capture discrepancies within the aligned regions of long reads and short reads, as well as the contextual relationships within the long reads themselves for error correction. Akin to natural language sequences, the long read can be regarded as a special "genetic language" and be processed with the idea of generative neural networks. The algorithm builds a sequence-to-sequence(seq2seq) framework with Recurrent Neural Network (RNN) as the core layer. The before and post-corrected long reads are regarded as the sentences in the source and target language of translation, and the alignment information of long reads with short reads is used to create the special corpus for training. The well-trained model can be used to predict the corrected long read. RESULTS NmTHC outperforms the latest mainstream hybrid error correction methods on real-world datasets from two mainstream platforms, including PacBio and Nanopore. Our experimental evaluation results demonstrate that NmTHC can align more bases with the reference genome without any segmenting in the six benchmark datasets, proving that it enhances alignment identity without sacrificing any length advantages of long reads. CONCLUSION Consequently, NmTHC reasonably adopts the generative Neural Machine Translation (NMT) model to transform hybrid error correction tasks into machine translation problems and provides a novel perspective for solving long-read error correction problems with the ideas of Natural Language Processing (NLP). More remarkably, the proposed methodology is sequencing-technology-independent and can produce more precise reads.
Collapse
Affiliation(s)
- Rongshu Wang
- Department of Electronic Engineering, Information School, Yunnan University, Kunming, Yunnan, China
| | - Jianhua Chen
- Department of Electronic Engineering, Information School, Yunnan University, Kunming, Yunnan, China.
| |
Collapse
|
24
|
Wattanasombat S, Tongjai S. Easing genomic surveillance: A comprehensive performance evaluation of long-read assemblers across multi-strain mixture data of HIV-1 and Other pathogenic viruses for constructing a user-friendly bioinformatic pipeline. F1000Res 2024; 13:556. [PMID: 38984017 PMCID: PMC11231628 DOI: 10.12688/f1000research.149577.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 05/14/2024] [Indexed: 07/11/2024] Open
Abstract
Background Determining the appropriate computational requirements and software performance is essential for efficient genomic surveillance. The lack of standardized benchmarking complicates software selection, especially with limited resources. Methods We developed a containerized benchmarking pipeline to evaluate seven long-read assemblers-Canu, GoldRush, MetaFlye, Strainline, HaploDMF, iGDA, and RVHaplo-for viral haplotype reconstruction, using both simulated and experimental Oxford Nanopore sequencing data of HIV-1 and other viruses. Benchmarking was conducted on three computational systems to assess each assembler's performance, utilizing QUAST and BLASTN for quality assessment. Results Our findings show that assembler choice significantly impacts assembly time, with CPU and memory usage having minimal effect. Assembler selection also influences the size of the contigs, with a minimum read length of 2,000 nucleotides required for quality assembly. A 4,000-nucleotide read length improves quality further. Canu was efficient among de novo assemblers but not suitable for multi-strain mixtures, while GoldRush produced only consensus assemblies. Strainline and MetaFlye were suitable for metagenomic sequencing data, with Strainline requiring high memory and MetaFlye operable on low-specification machines. Among reference-based assemblers, iGDA had high error rates, RVHaplo showed the best runtime and accuracy but became ineffective with similar sequences, and HaploDMF, utilizing machine learning, had fewer errors with a slightly longer runtime. Conclusions The HIV-64148 pipeline, containerized using Docker, facilitates easy deployment and offers flexibility to select from a range of assemblers to match computational systems or study requirements. This tool aids in genome assembly and provides valuable information on HIV-1 sequences, enhancing viral evolution monitoring and understanding.
Collapse
Affiliation(s)
- Sara Wattanasombat
- Department of Microbiology, Faculty of Medicine, Chiang Mai University, Chiang Mai, 50200, Thailand
| | - Siripong Tongjai
- Department of Microbiology, Faculty of Medicine, Chiang Mai University, Chiang Mai, 50200, Thailand
| |
Collapse
|
25
|
Szakállas N, Barták BK, Valcz G, Nagy ZB, Takács I, Molnár B. Can long-read sequencing tackle the barriers, which the next-generation could not? A review. Pathol Oncol Res 2024; 30:1611676. [PMID: 38818014 PMCID: PMC11137202 DOI: 10.3389/pore.2024.1611676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 04/30/2024] [Indexed: 06/01/2024]
Abstract
The large-scale heterogeneity of genetic diseases necessitated the deeper examination of nucleotide sequence alterations enhancing the discovery of new targeted drug attack points. The appearance of new sequencing techniques was essential to get more interpretable genomic data. In contrast to the previous short-reads, longer lengths can provide a better insight into the potential health threatening genetic abnormalities. Long-reads offer more accurate variant identification and genome assembly methods, indicating advances in nucleotide deflect-related studies. In this review, we introduce the historical background of sequencing technologies and show their benefits and limits, as well. Furthermore, we highlight the differences between short- and long-read approaches, including their unique advances and difficulties in methodologies and evaluation. Additionally, we provide a detailed description of the corresponding bioinformatics and the current applications.
Collapse
Affiliation(s)
- Nikolett Szakállas
- Department of Biological Physics, Faculty of Science, Eötvös Loránd University, Budapest, Hungary
| | - Barbara K. Barták
- Department of Internal Medicine and Oncology, Faculty of Medicine, Semmelweis University, Budapest, Hungary
| | - Gábor Valcz
- Department of Internal Medicine and Oncology, Faculty of Medicine, Semmelweis University, Budapest, Hungary
- HUN-REN-SU Translational Extracellular Vesicle Research Group, Budapest, Hungary
| | - Zsófia B. Nagy
- Department of Internal Medicine and Oncology, Faculty of Medicine, Semmelweis University, Budapest, Hungary
| | - István Takács
- Department of Internal Medicine and Oncology, Faculty of Medicine, Semmelweis University, Budapest, Hungary
| | - Béla Molnár
- Department of Internal Medicine and Oncology, Faculty of Medicine, Semmelweis University, Budapest, Hungary
| |
Collapse
|
26
|
Kim C, Pongpanich M, Porntaveetus T. Unraveling metagenomics through long-read sequencing: a comprehensive review. J Transl Med 2024; 22:111. [PMID: 38282030 PMCID: PMC10823668 DOI: 10.1186/s12967-024-04917-1] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Accepted: 01/21/2024] [Indexed: 01/30/2024] Open
Abstract
The study of microbial communities has undergone significant advancements, starting from the initial use of 16S rRNA sequencing to the adoption of shotgun metagenomics. However, a new era has emerged with the advent of long-read sequencing (LRS), which offers substantial improvements over its predecessor, short-read sequencing (SRS). LRS produces reads that are several kilobases long, enabling researchers to obtain more complete and contiguous genomic information, characterize structural variations, and study epigenetic modifications. The current leaders in LRS technologies are Pacific Biotechnologies (PacBio) and Oxford Nanopore Technologies (ONT), each offering a distinct set of advantages. This review covers the workflow of long-read metagenomics sequencing, including sample preparation (sample collection, sample extraction, and library preparation), sequencing, processing (quality control, assembly, and binning), and analysis (taxonomic annotation and functional annotation). Each section provides a concise outline of the key concept of the methodology, presenting the original concept as well as how it is challenged or modified in the context of LRS. Additionally, the section introduces a range of tools that are compatible with LRS and can be utilized to execute the LRS process. This review aims to present the workflow of metagenomics, highlight the transformative impact of LRS, and provide researchers with a selection of tools suitable for this task.
Collapse
Affiliation(s)
- Chankyung Kim
- Center of Excellence in Genomics and Precision Dentistry, Department of Physiology, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand
- Graduate Program in Bioinformatics and Computational Biology, Faculty of Science, Chulalongkorn University, Bangkok, Thailand
| | - Monnat Pongpanich
- Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok, Thailand
- Center of Excellence for Cancer and Inflammation, Chulalongkorn University, Bangkok, Thailand
| | - Thantrira Porntaveetus
- Center of Excellence in Genomics and Precision Dentistry, Department of Physiology, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand.
- Graduate Program in Geriatric and Special Patients Care, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand.
| |
Collapse
|
27
|
Heath HD, Peng S, Szmatola T, Bellone RR, Kalbfleisch T, Petersen JL, Finno CJ. A Comprehensive Allele Specific Expression Resource for the Equine Transcriptome. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.31.573798. [PMID: 38260378 PMCID: PMC10802363 DOI: 10.1101/2023.12.31.573798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Background Allele-specific expression (ASE) analysis provides a nuanced view of cis-regulatory mechanisms affecting gene expression. Results In this work, we introduce and highlight the significance of an equine ASE analysis, containing integrated long- and short-read RNA sequencing data, along with insight from histone modification data, from four healthy Thoroughbreds (2 mares and 2 stallions) across 9 tissues. Conclusions This valuable publicly accessible resource is poised to facilitate investigations into regulatory variation in equine tissues and foster a deeper understanding of the impact of allelic imbalance in equine health and disease at the molecular level.
Collapse
|
28
|
Schäfer L, Jehle JA, Kleespies RG, Wennmann JT. A practical guide and Galaxy workflow to avoid inter-plasmidic repeat collapse and false gene loss in Unicycler's hybrid assemblies. Microb Genom 2024; 10:001173. [PMID: 38197876 PMCID: PMC10868617 DOI: 10.1099/mgen.0.001173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 12/18/2023] [Indexed: 01/11/2024] Open
Abstract
Generating complete, high-quality genome assemblies is key for any downstream analysis, such as comparative genomics. For bacterial genome assembly, various algorithms and fully automated pipelines exist, which are free-of-charge and easily accessible. However, these assembly tools often cannot unambiguously resolve a bacterial genome, for example due to the presence of sequence repeat structures on the chromosome or on plasmids. Then, a more sophisticated approach and/or manual curation is needed. Such modifications can be challenging, especially for non-bioinformaticians, because they are generally not considered as a straightforward process. In this study, we propose a standardized approach for manual genome completion focusing on the popular hybrid assembly pipeline Unicycler. The provided Galaxy workflow addresses two weaknesses in Unicycler's hybrid assemblies: (i) collapse of inter-plasmidic repeats and (ii) false loss of single-copy sequences. To demonstrate and validate how to detect and resolve these assembly errors, we use two genomes from the Bacillus cereus group. By applying the proposed pipeline following an automated assembly, the genome sequence quality can be significantly improved.
Collapse
Affiliation(s)
- Lea Schäfer
- Julius Kühn Institute (JKI) – Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Johannes A. Jehle
- Julius Kühn Institute (JKI) – Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Regina G. Kleespies
- Julius Kühn Institute (JKI) – Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Jörg T. Wennmann
- Julius Kühn Institute (JKI) – Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| |
Collapse
|
29
|
Guo Y, Feng X, Li H. Evaluation of haplotype-aware long-read error correction with hifieval. Bioinformatics 2023; 39:btad631. [PMID: 37851384 PMCID: PMC10612404 DOI: 10.1093/bioinformatics/btad631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 09/18/2023] [Accepted: 10/17/2023] [Indexed: 10/19/2023] Open
Abstract
SUMMARY The PacBio High-Fidelity (HiFi) sequencing technology produces long reads of >99% in accuracy. It has enabled the development of a new generation of de novo sequence assemblers, which all have sequencing error correction (EC) as the first step. As HiFi is a new data type, this critical step has not been evaluated before. Here, we introduced hifieval, a new command-line tool for measuring over- and under-corrections produced by EC algorithms. We assessed the accuracy of the EC components of existing HiFi assemblers on the CHM13 and the HG002 datasets and further investigated the performance of EC methods in challenging regions such as homopolymer regions, centromeric regions, and segmental duplications. Hifieval will help HiFi assemblers to improve EC and assembly quality in the long run. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/magspho/hifieval.
Collapse
Affiliation(s)
- Yujie Guo
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02215, United States
| | - Xiaowen Feng
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States
| | - Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02215, United States
| |
Collapse
|
30
|
Lee C, Polo RO, Zaheer R, Van Domselaar G, Zovoilis A, McAllister TA. Evaluation of metagenomic assembly methods for the detection and characterization of antimicrobial resistance determinants and associated mobilizable elements. J Microbiol Methods 2023; 213:106815. [PMID: 37699502 DOI: 10.1016/j.mimet.2023.106815] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 08/31/2023] [Accepted: 08/31/2023] [Indexed: 09/14/2023]
Abstract
Antimicrobial resistance genes (ARGs) can be transferred between members of a bacterial population by mobile genetic elements (MGE). Understanding the risk of these transfer events is important in monitoring and predicting antimicrobial resistance (AMR), especially in the context of a One Health Continuum. However, there is no universally accepted method for detection of ARGs and MGEs, and especially for determining their linkages. This study used publicly available shotgun metagenomic DNA short-read (Illumina, 100 bp paired-end) sequence data from samples across the One Health Continuum (including beef cattle composite feces from feedlots, catch basin water at feedlots, agricultural soil from feedlot manured surrounding fields, and urban/municipal sewage influent from two municipal wastewater treatment plants) to develop a workflow to identify and associate ARGs and MGEs. ARG- and MGE-based targeted-assemblies with available short-read data were unable to meet this analysis goal. In contrast, de novo assembly of contigs provided enough sequence context to associate ARGs and MGEs, without compromising discovery rate. However, to estimate the relative abundance of these elements, unassembled sequence data must still be used.
Collapse
Affiliation(s)
- Catrione Lee
- Lethbridge Research and Development Centre, Agriculture and Agri-Food Canada, Government of Canada, 5403 1st Avenue South, Lethbridge, AB T1J 4B1, Canada; Department of Chemistry and Biochemistry, University of Lethbridge, 4401 University Drive West, Lethbridge, AB T3M 2L7, Canada
| | - Rodrigo Ortega Polo
- Lethbridge Research and Development Centre, Agriculture and Agri-Food Canada, Government of Canada, 5403 1st Avenue South, Lethbridge, AB T1J 4B1, Canada
| | - Rahat Zaheer
- Lethbridge Research and Development Centre, Agriculture and Agri-Food Canada, Government of Canada, 5403 1st Avenue South, Lethbridge, AB T1J 4B1, Canada
| | - Gary Van Domselaar
- National Microbiology Laboratory, Public Health Agency of Canada, Government of Canada, 1015 Arlington Street, Winnipeg, MB R3E 3R2, Canada
| | - Athanasios Zovoilis
- Department of Chemistry and Biochemistry, University of Lethbridge, 4401 University Drive West, Lethbridge, AB T3M 2L7, Canada
| | - Tim A McAllister
- Lethbridge Research and Development Centre, Agriculture and Agri-Food Canada, Government of Canada, 5403 1st Avenue South, Lethbridge, AB T1J 4B1, Canada.
| |
Collapse
|
31
|
Wang J, Veldsman WP, Fang X, Huang Y, Xie X, Lyu A, Zhang L. Benchmarking multi-platform sequencing technologies for human genome assembly. Brief Bioinform 2023; 24:bbad300. [PMID: 37594299 DOI: 10.1093/bib/bbad300] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2023] [Revised: 07/12/2023] [Accepted: 07/26/2023] [Indexed: 08/19/2023] Open
Abstract
Genome assembly is a computational technique that involves piecing together deoxyribonucleic acid (DNA) fragments generated by sequencing technologies to create a comprehensive and precise representation of the entire genome. Generating a high-quality human reference genome is a crucial prerequisite for comprehending human biology, and it is also vital for downstream genomic variation analysis. Many efforts have been made over the past few decades to create a complete and gapless reference genome for humans by using a diverse range of advanced sequencing technologies. Several available tools are aimed at enhancing the quality of haploid and diploid human genome assemblies, which include contig assembly, polishing of contig errors, scaffolding and variant phasing. Selecting the appropriate tools and technologies remains a daunting task despite several studies have investigated the pros and cons of different assembly strategies. The goal of this paper was to benchmark various strategies for human genome assembly by combining sequencing technologies and tools on two publicly available samples (NA12878 and NA24385) from Genome in a Bottle. We then compared their performances in terms of continuity, accuracy, completeness, variant calling and phasing. We observed that PacBio HiFi long-reads are the optimal choice for generating an assembly with low base errors. On the other hand, we were able to produce the most continuous contigs with Oxford Nanopore long-reads, but they may require further polishing to improve on quality. We recommend using short-reads rather than long-reads themselves to improve the base accuracy of contigs from Oxford Nanopore long-reads. Hi-C is the best choice for chromosome-level scaffolding because it can capture the longest-range DNA connectedness compared to 10× linked-reads and Bionano optical maps. However, a combination of multiple technologies can be used to further improve the quality and completeness of genome assembly. For diploid assembly, hifiasm is the best tool for human diploid genome assembly using PacBio HiFi and Hi-C data. Looking to the future, we expect that further advancements in human diploid assemblers will leverage the power of PacBio HiFi reads and other technologies with long-range DNA connectedness to enable the generation of high-quality, chromosome-level and haplotype-resolved human genome assemblies.
Collapse
Affiliation(s)
- Jingjing Wang
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Werner Pieter Veldsman
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | | | | | | | - Aiping Lyu
- School of Chinese Medicine, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
- Institute for Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China
| |
Collapse
|
32
|
van Dijk EL, Naquin D, Gorrichon K, Jaszczyszyn Y, Ouazahrou R, Thermes C, Hernandez C. Genomics in the long-read sequencing era. Trends Genet 2023; 39:649-671. [PMID: 37230864 DOI: 10.1016/j.tig.2023.04.006] [Citation(s) in RCA: 53] [Impact Index Per Article: 26.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 04/21/2023] [Accepted: 04/25/2023] [Indexed: 05/27/2023]
Abstract
Long-read sequencing (LRS) technologies have provided extremely powerful tools to explore genomes. While in the early years these methods suffered technical limitations, they have recently made significant progress in terms of read length, throughput, and accuracy and bioinformatics tools have strongly improved. Here, we aim to review the current status of LRS technologies, the development of novel methods, and the impact on genomics research. We will explore the most impactful recent findings made possible by these technologies focusing on high-resolution sequencing of genomes and transcriptomes and the direct detection of DNA and RNA modifications. We will also discuss how LRS methods promise a more comprehensive understanding of human genetic variation, transcriptomics, and epigenetics for the coming years.
Collapse
Affiliation(s)
- Erwin L van Dijk
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France.
| | - Delphine Naquin
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| | - Kévin Gorrichon
- National Center of Human Genomics Research (CNRGH), 91000 Évry-Courcouronnes, France
| | - Yan Jaszczyszyn
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| | - Rania Ouazahrou
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| | - Claude Thermes
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| | - Céline Hernandez
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| |
Collapse
|
33
|
Ojala T, Häkkinen AE, Kankuri E, Kankainen M. Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics. Trends Genet 2023; 39:686-702. [PMID: 37365103 DOI: 10.1016/j.tig.2023.05.004] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Revised: 05/24/2023] [Accepted: 05/25/2023] [Indexed: 06/28/2023]
Abstract
Metatranscriptomics refers to the analysis of the collective microbial transcriptome of a sample. Its increased utilization for the characterization of human-associated microbial communities has enabled the discovery of many disease-state related microbial activities. Here, we review the principles of metatranscriptomics-based analysis of human-associated microbial samples. We describe strengths and weaknesses of popular sample preparation, sequencing, and bioinformatics approaches and summarize strategies for their use. We then discuss how human-associated microbial communities have recently been examined and how their characterization may change. We conclude that metatranscriptomics insights into human microbiotas under health and disease have not only expanded our knowledge on human health, but also opened avenues for rational antimicrobial drug use and disease management.
Collapse
Affiliation(s)
- Teija Ojala
- Department of Pharmacology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | | | - Esko Kankuri
- Department of Pharmacology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Matti Kankainen
- Hematology Research Unit, University of Helsinki, Helsinki, Finland; Laboratory of Genetics, HUS Diagnostic Center, Hospital District of Helsinki and Uusimaa (HUS), Helsinki, Finland.
| |
Collapse
|
34
|
Ruiz JL, Reimering S, Escobar-Prieto JD, Brancucci NMB, Echeverry DF, Abdi AI, Marti M, Gómez-Díaz E, Otto TD. From contigs towards chromosomes: automatic improvement of long read assemblies (ILRA). Brief Bioinform 2023; 24:bbad248. [PMID: 37406192 PMCID: PMC10359078 DOI: 10.1093/bib/bbad248] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 05/24/2023] [Accepted: 06/16/2023] [Indexed: 07/07/2023] Open
Abstract
Recent advances in long read technologies not only enable large consortia to aim to sequence all eukaryotes on Earth, but they also allow individual laboratories to sequence their species of interest with relatively low investment. Long read technologies embody the promise of overcoming scaffolding problems associated with repeats and low complexity sequences, but the number of contigs often far exceeds the number of chromosomes and they may contain many insertion and deletion errors around homopolymer tracts. To overcome these issues, we have implemented the ILRA pipeline to correct long read-based assemblies. Contigs are first reordered, renamed, merged, circularized, or filtered if erroneous or contaminated. Illumina short reads are used subsequently to correct homopolymer errors. We successfully tested our approach by improving the genome sequences of Homo sapiens, Trypanosoma brucei, and Leptosphaeria spp., and by generating four novel Plasmodium falciparum assemblies from field samples. We found that correcting homopolymer tracts reduced the number of genes incorrectly annotated as pseudogenes, but an iterative approach seems to be required to correct more sequencing errors. In summary, we describe and benchmark the performance of our new tool, which improved the quality of novel long read assemblies up to 1 Gbp. The pipeline is available at GitHub: https://github.com/ThomasDOtto/ILRA.
Collapse
Affiliation(s)
- José Luis Ruiz
- Instituto de Parasitología y Biomedicina López-Neyra (IPBLN), Consejo Superior de Investigaciones Científicas, 18016, Granada, Spain
| | - Susanne Reimering
- Department for Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | | | - Nicolas M B Brancucci
- School of Infection & Immunity, MVLS, University of Glasgow, Glasgow, UK
- Department of Medical Parasitology and Infection Biology, Swiss Tropical and Public Health Institute, 4123 Allschwil, Switzerland
- University of Basel, 4001 Basel, Switzerland
| | - Diego F Echeverry
- Centro Internacional de Entrenamiento e Investigaciones Médicas (CIDEIM), Cali, Colombia
- Departamento de Microbiología, Facultad de Salud, Universidad del Valle, Cali, Colombia
| | | | - Matthias Marti
- School of Infection & Immunity, MVLS, University of Glasgow, Glasgow, UK
| | - Elena Gómez-Díaz
- Instituto de Parasitología y Biomedicina López-Neyra (IPBLN), Consejo Superior de Investigaciones Científicas, 18016, Granada, Spain
| | - Thomas D Otto
- School of Infection & Immunity, MVLS, University of Glasgow, Glasgow, UK
| |
Collapse
|
35
|
Karikari B, Lemay MA, Belzile F. k-mer-Based Genome-Wide Association Studies in Plants: Advances, Challenges, and Perspectives. Genes (Basel) 2023; 14:1439. [PMID: 37510343 PMCID: PMC10379394 DOI: 10.3390/genes14071439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 07/04/2023] [Accepted: 07/07/2023] [Indexed: 07/30/2023] Open
Abstract
Genome-wide association studies (GWAS) have allowed the discovery of marker-trait associations in crops over recent decades. However, their power is hampered by a number of limitations, with the key one among them being an overreliance on single-nucleotide polymorphisms (SNPs) as molecular markers. Indeed, SNPs represent only one type of genetic variation and are usually derived from alignment to a single genome assembly that may be poorly representative of the population under study. To overcome this, k-mer-based GWAS approaches have recently been developed. k-mer-based GWAS provide a universal way to assess variation due to SNPs, insertions/deletions, and structural variations without having to specifically detect and genotype these variants. In addition, k-mer-based analyses can be used in species that lack a reference genome. However, the use of k-mers for GWAS presents challenges such as data size and complexity, lack of standard tools, and potential detection of false associations. Nevertheless, efforts are being made to overcome these challenges and a general analysis workflow has started to emerge. We identify the priorities for k-mer-based GWAS in years to come, notably in the development of user-friendly programs for their analysis and approaches for linking significant k-mers to sequence variation.
Collapse
Affiliation(s)
- Benjamin Karikari
- Département de Phytologie, Université Laval, Quebec City, QC G1V 0A6, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada
- Department of Agricultural Biotechnology, Faculty of Agriculture, Food and Consumer Sciences, University for Development Studies, Tamale P.O. Box TL 1882, Ghana
| | - Marc-André Lemay
- Département de Phytologie, Université Laval, Quebec City, QC G1V 0A6, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada
| | - François Belzile
- Département de Phytologie, Université Laval, Quebec City, QC G1V 0A6, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada
| |
Collapse
|
36
|
Gable SM, Mendez JM, Bushroe NA, Wilson A, Byars MI, Tollis M. The State of Squamate Genomics: Past, Present, and Future of Genome Research in the Most Speciose Terrestrial Vertebrate Order. Genes (Basel) 2023; 14:1387. [PMID: 37510292 PMCID: PMC10379679 DOI: 10.3390/genes14071387] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Revised: 06/28/2023] [Accepted: 06/29/2023] [Indexed: 07/30/2023] Open
Abstract
Squamates include more than 11,000 extant species of lizards, snakes, and amphisbaenians, and display a dazzling diversity of phenotypes across their over 200-million-year evolutionary history on Earth. Here, we introduce and define squamates (Order Squamata) and review the history and promise of genomic investigations into the patterns and processes governing squamate evolution, given recent technological advances in DNA sequencing, genome assembly, and evolutionary analysis. We survey the most recently available whole genome assemblies for squamates, including the taxonomic distribution of available squamate genomes, and assess their quality metrics and usefulness for research. We then focus on disagreements in squamate phylogenetic inference, how methods of high-throughput phylogenomics affect these inferences, and demonstrate the promise of whole genomes to settle or sustain persistent phylogenetic arguments for squamates. We review the role transposable elements play in vertebrate evolution, methods of transposable element annotation and analysis, and further demonstrate that through the understanding of the diversity, abundance, and activity of transposable elements in squamate genomes, squamates can be an ideal model for the evolution of genome size and structure in vertebrates. We discuss how squamate genomes can contribute to other areas of biological research such as venom systems, studies of phenotypic evolution, and sex determination. Because they represent more than 30% of the living species of amniote, squamates deserve a genome consortium on par with recent efforts for other amniotes (i.e., mammals and birds) that aim to sequence most of the extant families in a clade.
Collapse
Affiliation(s)
- Simone M Gable
- School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, AZ 86011, USA
| | - Jasmine M Mendez
- School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, AZ 86011, USA
| | - Nicholas A Bushroe
- School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, AZ 86011, USA
| | - Adam Wilson
- School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, AZ 86011, USA
| | - Michael I Byars
- School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, AZ 86011, USA
| | - Marc Tollis
- School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, AZ 86011, USA
| |
Collapse
|
37
|
Boßelmann CM, Leu C, Lal D. Technological and computational approaches to detect somatic mosaicism in epilepsy. Neurobiol Dis 2023:106208. [PMID: 37343892 DOI: 10.1016/j.nbd.2023.106208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2023] [Revised: 06/03/2023] [Accepted: 06/16/2023] [Indexed: 06/23/2023] Open
Abstract
Lesional epilepsy is a common and severe disease commonly associated with malformations of cortical development, including focal cortical dysplasia and hemimegalencephaly. Recent advances in sequencing and variant calling technologies have identified several genetic causes, including both short/single nucleotide and structural somatic variation. In this review, we aim to provide a comprehensive overview of the methodological advancements in this field while highlighting the unresolved technological and computational challenges that persist, including ultra-low variant allele fractions in bulk tissue, low availability of paired control samples, spatial variability of mutational burden within the lesion, and the issue of false-positive calls and validation procedures. Information from genetic testing in focal epilepsy may be integrated into clinical care to inform histopathological diagnosis, postoperative prognosis, and candidate precision therapies.
Collapse
Affiliation(s)
- Christian M Boßelmann
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA; Epilepsy Center, Neurological Institute, Cleveland Clinic, Cleveland, OH, USA
| | - Costin Leu
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA; Department of Clinical and Experimental Epilepsy, Institute of Neurology, University College London, London, UK.
| | - Dennis Lal
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA; Epilepsy Center, Neurological Institute, Cleveland Clinic, Cleveland, OH, USA; Stanley Center for Psychiatric Research, Broad Institute of Harvard and M.I.T., Cambridge, MA, USA; Cologne Center for Genomics (CCG), University of Cologne, Cologne, DE, USA
| |
Collapse
|
38
|
Mastrorosa FK, Miller DE, Eichler EE. Applications of long-read sequencing to Mendelian genetics. Genome Med 2023; 15:42. [PMID: 37316925 PMCID: PMC10266321 DOI: 10.1186/s13073-023-01194-3] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Accepted: 05/18/2023] [Indexed: 06/16/2023] Open
Abstract
Advances in clinical genetic testing, including the introduction of exome sequencing, have uncovered the molecular etiology for many rare and previously unsolved genetic disorders, yet more than half of individuals with a suspected genetic disorder remain unsolved after complete clinical evaluation. A precise genetic diagnosis may guide clinical treatment plans, allow families to make informed care decisions, and permit individuals to participate in N-of-1 trials; thus, there is high interest in developing new tools and techniques to increase the solve rate. Long-read sequencing (LRS) is a promising technology for both increasing the solve rate and decreasing the amount of time required to make a precise genetic diagnosis. Here, we summarize current LRS technologies, give examples of how they have been used to evaluate complex genetic variation and identify missing variants, and discuss future clinical applications of LRS. As costs continue to decrease, LRS will find additional utility in the clinical space fundamentally changing how pathological variants are discovered and eventually acting as a single-data source that can be interrogated multiple times for clinical service.
Collapse
Affiliation(s)
| | - Danny E Miller
- Division of Genetic Medicine, Department of Pediatrics, University of Washington and Seattle Children's Hospital, Seattle, WA, 98195, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA, 98195, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, 98195, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA.
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, 98195, USA.
| |
Collapse
|
39
|
Ezoe A, Iuchi S, Sakurai T, Aso Y, Tokunaga H, Vu AT, Utsumi Y, Takahashi S, Tanaka M, Ishida J, Ishitani M, Seki M. Fully sequencing the cassava full-length cDNA library reveals unannotated transcript structures and alternative splicing events in regions with a high density of single nucleotide variations, insertions-deletions, and heterozygous sequences. PLANT MOLECULAR BIOLOGY 2023; 112:33-45. [PMID: 37014509 DOI: 10.1007/s11103-023-01346-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Accepted: 02/27/2023] [Indexed: 05/09/2023]
Abstract
The primary transcript structure provides critical insights into protein diversity, transcriptional modification, and functions. Cassava transcript structures are highly diverse because of alternative splicing (AS) events and high heterozygosity. To precisely determine and characterize transcript structures, fully sequencing cloned transcripts is the most reliable method. However, cassava annotations were mainly determined according to fragmentation-based sequencing analyses (e.g., EST and short-read RNA-seq). In this study, we sequenced the cassava full-length cDNA library, which included rare transcripts. We obtained 8,628 non-redundant fully sequenced transcripts and detected 615 unannotated AS events and 421 unannotated loci. The different protein sequences resulting from the unannotated AS events tended to have diverse functional domains, implying that unannotated AS contributes to the truncation of functional domains. The unannotated loci tended to be derived from orphan genes, implying that the loci may be associated with cassava-specific traits. Unexpectedly, individual cassava transcripts were more likely to have multiple AS events than Arabidopsis transcripts, suggestive of the regulated interactions between cassava splicing-related complexes. We also observed that the unannotated loci and/or AS events were commonly in regions with abundant single nucleotide variations, insertions-deletions, and heterozygous sequences. These findings reflect the utility of completely sequenced FLcDNA clones for overcoming cassava-specific annotation-related problems to elucidate transcript structures. Our work provides researchers with transcript structural details that are useful for annotating highly diverse and unique transcripts and alternative splicing events.
Collapse
Affiliation(s)
- Akihiro Ezoe
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, Yokohama, Kanagawa, 230-0045, Japan
| | - Satoshi Iuchi
- Experimental Plant Division, RIKEN BioResource Research Center, Tsukuba, Ibaraki, 305-0074, Japan
| | - Tetsuya Sakurai
- Multidisciplinary Science Cluster, Interdisciplinary Science Unit, Kochi University, Nankoku, Kochi, 783-8502, Japan
| | - Yukie Aso
- Experimental Plant Division, RIKEN BioResource Research Center, Tsukuba, Ibaraki, 305-0074, Japan
| | - Hiroki Tokunaga
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, Yokohama, Kanagawa, 230-0045, Japan
- Tropical Agriculture Research Front, Japan International Research Center for Agricultural Sciences, Ishigaki, Okinawa, 907-0002, Japan
| | - Anh Thu Vu
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, Yokohama, Kanagawa, 230-0045, Japan
| | - Yoshinori Utsumi
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, Yokohama, Kanagawa, 230-0045, Japan
| | - Satoshi Takahashi
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, Yokohama, Kanagawa, 230-0045, Japan
- Plant Epigenome Regulation Laboratory, RIKEN Cluster for Pioneering Research, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan
| | - Maho Tanaka
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, Yokohama, Kanagawa, 230-0045, Japan
- Plant Epigenome Regulation Laboratory, RIKEN Cluster for Pioneering Research, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan
| | - Junko Ishida
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, Yokohama, Kanagawa, 230-0045, Japan
- Plant Epigenome Regulation Laboratory, RIKEN Cluster for Pioneering Research, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan
| | - Manabu Ishitani
- International Center for Tropical Agriculture (CIAT), Km 17, Recta Cali-Palmira Apartado Aéreo 6713, Cali, Colombia
| | - Motoaki Seki
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, Yokohama, Kanagawa, 230-0045, Japan.
- Plant Epigenome Regulation Laboratory, RIKEN Cluster for Pioneering Research, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan.
- Kihara Institute for Biological Research, Yokohama City University, 641-12 Maioka-cho, Totsuka-ku, Yokohama, Kanagawa, 244-0813, Japan.
| |
Collapse
|
40
|
Mejias-Gomez O, Madsen AV, Pedersen LE, Kristensen P, Goletz S. Eliminating OFF-frame clones in randomized gene libraries: An improved split β-lactamase enrichment system. N Biotechnol 2023; 75:13-20. [PMID: 36889578 DOI: 10.1016/j.nbt.2023.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 02/20/2023] [Accepted: 03/04/2023] [Indexed: 03/08/2023]
Abstract
Large, randomized libraries are a key technology for many biotechnological applications. While genetic diversity is the main parameter most libraries direct their resources on, less focus is devoted to ensuring functional IN-frame expression. This study describes a faster and more efficient system based on a split β-lactamase complementation for removal of OFF-frame clones and increase of functional diversity, suitable for construction of randomized libraries. The gene of interest is inserted between two fragments of the β-lactamase gene, conferring resistance to β-lactam drugs only upon expression of an inserted IN-frame gene without stop codons or frameshifts. The preinduction-free system was capable of eliminating OFF-frame clones in starting mixtures of as little as 1% IN-frame clones and enriching to about 70% IN-frame clones, even when their starting rate was as low as 0.001%. The curation system was verified by constructing a single-domain antibody phage display library using trinucleotide phosphoramidites for randomizing a complementary determining region, while eliminating OFF-frame clones and maximizing functional diversity.
Collapse
Affiliation(s)
- Oscar Mejias-Gomez
- Department of Biotechnology and Biomedicine, Section for Protein Science and Biotherapeutics, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Andreas V Madsen
- Department of Biotechnology and Biomedicine, Section for Protein Science and Biotherapeutics, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Lasse E Pedersen
- Department of Biotechnology and Biomedicine, Section for Protein Science and Biotherapeutics, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Peter Kristensen
- Department of Chemistry and Bioscience, Section for Bioscience and Engineering, Aalborg University, Aalborg, Denmark
| | - Steffen Goletz
- Department of Biotechnology and Biomedicine, Section for Protein Science and Biotherapeutics, Technical University of Denmark, Kongens Lyngby, Denmark.
| |
Collapse
|
41
|
Mak QXC, Wick RR, Holt JM, Wang JR. Polishing De Novo Nanopore Assemblies of Bacteria and Eukaryotes With FMLRC2. Mol Biol Evol 2023; 40:7069220. [PMID: 36869750 PMCID: PMC10015616 DOI: 10.1093/molbev/msad048] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 01/20/2023] [Accepted: 02/21/2023] [Indexed: 03/05/2023] Open
Abstract
As the accuracy and throughput of nanopore sequencing improve, it is increasingly common to perform long-read first de novo genome assemblies followed by polishing with accurate short reads. We briefly introduce FMLRC2, the successor to the original FM-index Long Read Corrector (FMLRC), and illustrate its performance as a fast and accurate de novo assembly polisher for both bacterial and eukaryotic genomes.
Collapse
Affiliation(s)
- Q X Charles Mak
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore
| | - Ryan R Wick
- Centre for Pathogen Genomics, University of Melbourne, Melbourne, Australia
| | | | - Jeremy R Wang
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
42
|
Firtina C, Park J, Alser M, Kim JS, Cali D, Shahroodi T, Ghiasi N, Singh G, Kanellopoulos K, Alkan C, Mutlu O. BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis. NAR Genom Bioinform 2023; 5:lqad004. [PMID: 36685727 PMCID: PMC9853099 DOI: 10.1093/nargab/lqad004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 12/16/2022] [Accepted: 01/10/2023] [Indexed: 01/22/2023] Open
Abstract
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4×-83.9× (on average 19.3×), has a lower memory footprint by 0.9×-14.1× (on average 3.8×), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (on average 1.7×) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
Collapse
Affiliation(s)
| | - Jisung Park
- ETH Zurich, Zurich 8092, Switzerland
- POSTECH, Pohang 37673, Republic of Korea
| | | | | | | | | | | | | | | | - Can Alkan
- Bilkent University, Ankara 06800, Turkey
| | | |
Collapse
|
43
|
Liang C, Wagstaff J, Aharony N, Schmit V, Manheim D. Managing the Transition to Widespread Metagenomic Monitoring: Policy Considerations for Future Biosurveillance. Health Secur 2023; 21:34-45. [PMID: 36629860 PMCID: PMC9940815 DOI: 10.1089/hs.2022.0029] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
The technological possibilities and future public health importance of metagenomic sequencing have received extensive attention, but there has been little discussion about the policy and regulatory issues that need to be addressed if metagenomic sequencing is adopted as a key technology for biosurveillance. In this article, we introduce metagenomic monitoring as a possible path to eventually replacing current infectious disease monitoring models. Many key enablers are technological, whereas others are not. We therefore highlight key policy challenges and implementation questions that need to be addressed for "widespread metagenomic monitoring" to be possible. Policymakers must address pitfalls like fragmentation of the technological base, private capture of benefits, privacy concerns, the usefulness of the system during nonpandemic times, and how the future systems will enable better response. If these challenges are addressed, the technological and public health promise of metagenomic sequencing can be realized.
Collapse
Affiliation(s)
- Chelsea Liang
- Chelsea Liang is an Independent Researcher, University of New South Wales, School of Biotechnology and Biomolecular Sciences, Sydney, Australia
| | - James Wagstaff
- James Wagstaff, PhD, is a Research Fellow, Future of Humanity Institute, University of Oxford, Oxford, UK
| | - Noga Aharony
- Noga Aharony, MS, is a PhD Student, Department of Systems Biology, Columbia University, New York, NY
| | - Virginia Schmit
- Virginia Schmit, PhD, is Director of Research, 1DatSooner, DE, and a Policy Specialist, National Institute of Allergy and Infectious Diseases, Bethesda, MD
| | - David Manheim
- David Manheim, PhD, is Head of Policy and Research, ALTER, Rehovot, Israel; Lead Researcher, 1DaySooner, Claymont, DE,Visiting Researcher, Humanities and Arts Department, Technion – Israel Institute of Technology, Haifa, Israel.,Address correspondence to: David B. Manheim, 8734 First Avenue, Silver Spring, MD 20910
| |
Collapse
|
44
|
Nguyen TV, Vander Jagt CJ, Wang J, Daetwyler HD, Xiang R, Goddard ME, Nguyen LT, Ross EM, Hayes BJ, Chamberlain AJ, MacLeod IM. In it for the long run: perspectives on exploiting long-read sequencing in livestock for population scale studies of structural variants. Genet Sel Evol 2023; 55:9. [PMID: 36721111 PMCID: PMC9887926 DOI: 10.1186/s12711-023-00783-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 01/23/2023] [Indexed: 02/02/2023] Open
Abstract
Studies have demonstrated that structural variants (SV) play a substantial role in the evolution of species and have an impact on Mendelian traits in the genome. However, unlike small variants (< 50 bp), it has been challenging to accurately identify and genotype SV at the population scale using short-read sequencing. Long-read sequencing technologies are becoming competitively priced and can address several of the disadvantages of short-read sequencing for the discovery and genotyping of SV. In livestock species, analysis of SV at the population scale still faces challenges due to the lack of resources, high costs, technological barriers, and computational limitations. In this review, we summarize recent progress in the characterization of SV in the major livestock species, the obstacles that still need to be overcome, as well as the future directions in this growing field. It seems timely that research communities pool resources to build global population-scale long-read sequencing consortiums for the major livestock species for which the application of genomic tools has become cost-effective.
Collapse
Affiliation(s)
- Tuan V. Nguyen
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia
| | | | - Jianghui Wang
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia
| | - Hans D. Daetwyler
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC 3083 Australia
| | - Ruidong Xiang
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia
- Faculty of Veterinary & Agricultural Science, The University of Melbourne, Parkville, VIC 3052 Australia
| | - Michael E. Goddard
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia
- Faculty of Veterinary & Agricultural Science, The University of Melbourne, Parkville, VIC 3052 Australia
| | - Loan T. Nguyen
- Queensland Alliance for Agriculture and Food Innovation, University of Queensland, St Lucia, QLD 4072 Australia
| | - Elizabeth M. Ross
- Queensland Alliance for Agriculture and Food Innovation, University of Queensland, St Lucia, QLD 4072 Australia
| | - Ben J. Hayes
- Queensland Alliance for Agriculture and Food Innovation, University of Queensland, St Lucia, QLD 4072 Australia
| | - Amanda J. Chamberlain
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC 3083 Australia
| | - Iona M. MacLeod
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia
| |
Collapse
|
45
|
Hassan S, Bahar R, Johan MF, Mohamed Hashim EK, Abdullah WZ, Esa E, Abdul Hamid FS, Zulkafli Z. Next-Generation Sequencing (NGS) and Third-Generation Sequencing (TGS) for the Diagnosis of Thalassemia. Diagnostics (Basel) 2023; 13:diagnostics13030373. [PMID: 36766477 PMCID: PMC9914462 DOI: 10.3390/diagnostics13030373] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Revised: 01/11/2023] [Accepted: 01/16/2023] [Indexed: 01/20/2023] Open
Abstract
Thalassemia is one of the most heterogeneous diseases, with more than a thousand mutation types recorded worldwide. Molecular diagnosis of thalassemia by conventional PCR-based DNA analysis is time- and resource-consuming owing to the phenotype variability, disease complexity, and molecular diagnostic test limitations. Moreover, genetic counseling must be backed-up by an extensive diagnosis of the thalassemia-causing phenotype and the possible genetic modifiers. Data coming from advanced molecular techniques such as targeted sequencing by next-generation sequencing (NGS) and third-generation sequencing (TGS) are more appropriate and valuable for DNA analysis of thalassemia. While NGS is superior at variant calling to TGS thanks to its lower error rates, the longer reads nature of the TGS permits haplotype-phasing that is superior for variant discovery on the homologous genes and CNV calling. The emergence of many cutting-edge machine learning-based bioinformatics tools has improved the accuracy of variant and CNV calling. Constant improvement of these sequencing and bioinformatics will enable precise thalassemia detections, especially for the CNV and the homologous HBA and HBG genes. In conclusion, laboratory transiting from conventional DNA analysis to NGS or TGS and following the guidelines towards a single assay will contribute to a better diagnostics approach of thalassemia.
Collapse
Affiliation(s)
- Syahzuwan Hassan
- Department of Hematology, School of Medical Sciences, Health Campus, Universiti Sains Malaysia, Kubang Kerian 16150, Malaysia
- Institute for Medical Research, Shah Alam 40170, Malaysia
| | - Rosnah Bahar
- Department of Hematology, School of Medical Sciences, Health Campus, Universiti Sains Malaysia, Kubang Kerian 16150, Malaysia
| | - Muhammad Farid Johan
- Department of Hematology, School of Medical Sciences, Health Campus, Universiti Sains Malaysia, Kubang Kerian 16150, Malaysia
| | | | - Wan Zaidah Abdullah
- Department of Hematology, School of Medical Sciences, Health Campus, Universiti Sains Malaysia, Kubang Kerian 16150, Malaysia
| | - Ezalia Esa
- Institute for Medical Research, Shah Alam 40170, Malaysia
| | | | - Zefarina Zulkafli
- Department of Hematology, School of Medical Sciences, Health Campus, Universiti Sains Malaysia, Kubang Kerian 16150, Malaysia
- Correspondence:
| |
Collapse
|
46
|
Zhou Y, Lauschke VM. Challenges Related to the Use of Next-Generation Sequencing for the Optimization of Drug Therapy. Handb Exp Pharmacol 2023; 280:237-260. [PMID: 35792943 DOI: 10.1007/164_2022_596] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Over the last decade, next-generation sequencing (NGS) methods have become increasingly used in various areas of human genomics. In routine clinical care, their use is already implemented in oncology to profile the mutational landscape of a tumor, as well as in rare disease diagnostics. However, its utilization in pharmacogenomics is largely lacking behind. Recent population-scale genome data has revealed that human pharmacogenes carry a plethora of rare genetic variations that are not interrogated by conventional array-based profiling methods and it is estimated that these variants could explain around 30% of the genetically encoded functional pharmacogenetic variability.To interpret the impact of such variants on drug response a multitude of computational tools have been developed, but, while there have been major advancements, it remains to be shown whether their accuracy is sufficient to improve personalized pharmacogenetic recommendations in robust trials. In addition, conventional short-read sequencing methods face difficulties in the interrogation of complex pharmacogenes and high NGS test costs require stringent evaluations of cost-effectiveness to decide about reimbursement by national healthcare programs. Here, we illustrate current challenges and discuss future directions toward the clinical implementation of NGS to inform genotype-guided decision-making.
Collapse
Affiliation(s)
- Yitian Zhou
- Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden
| | - Volker M Lauschke
- Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden.
- Dr Margarete Fischer-Bosch Institute of Clinical Pharmacology, Stuttgart, Germany.
- University of Tuebingen, Tuebingen, Germany.
| |
Collapse
|
47
|
Li Q, Yan B, Lam TW, Luo R. Assembly-free discovery of human novel sequences using long reads. DNA Res 2022; 29:dsac039. [PMID: 36308393 PMCID: PMC9700288 DOI: 10.1093/dnares/dsac039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2022] [Revised: 10/19/2022] [Accepted: 10/27/2022] [Indexed: 09/10/2024] Open
Abstract
DNA sequences that are absent in the human reference genome are classified as novel sequences. The discovery of these missed sequences is crucial for exploring the genomic diversity of populations and understanding the genetic basis of human diseases. However, various DNA lengths of reads generated from different sequencing technologies can significantly affect the results of novel sequences. In this work, we designed an assembly-free novel sequence (AF-NS) approach to identify novel sequences from Oxford Nanopore Technology long reads. Among the newly detected sequences using AF-NS, more than 95% were omitted from those using long-read assemblers and 85% were not present in short reads of Illumina. We identified the common novel sequences among all the samples and revealed their association with the binding motifs of transcription factors. Regarding the placements of the novel sequences, we found about 70% enriched in repeat regions and generated 430 for one specific subpopulation that might be related to their evolution. Our study demonstrates the advance of the assembly-free approach to capture more novel sequences over other assembler based methods. Combining the long-read data with powerful analytical methods can be a robust way to improve the completeness of novel sequences.
Collapse
Affiliation(s)
- Qiuhui Li
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Bin Yan
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| |
Collapse
|
48
|
Rayamajhi N, Cheng CHC, Catchen JM. Evaluating Illumina-, Nanopore-, and PacBio-based genome assembly strategies with the bald notothen, Trematomus borchgrevinki. G3 (BETHESDA, MD.) 2022; 12:jkac192. [PMID: 35904764 PMCID: PMC9635638 DOI: 10.1093/g3journal/jkac192] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 07/18/2022] [Indexed: 11/16/2022]
Abstract
For any genome-based research, a robust genome assembly is required. De novo assembly strategies have evolved with changes in DNA sequencing technologies and have been through at least 3 phases: (1) short-read only, (2) short- and long-read hybrid, and (3) long-read only assemblies. Each of the phases has its own error model. We hypothesized that hidden short-read scaffolding errors and erroneous long-read contigs degrade the quality of short- and long-read hybrid assemblies. We assembled the genome of Trematomus borchgrevinki from data generated during each of the 3 phases and assessed the quality problems we encountered. We developed strategies such as k-mer-assembled region replacement, parameter optimization, and long-read sampling to address the error models. We demonstrated that a k-mer-based strategy improved short-read assemblies as measured by Benchmarking Universal Single-Copy Ortholog while mate-pair libraries introduced hidden scaffolding errors and perturbed Benchmarking Universal Single-Copy Ortholog scores. Furthermore, we found that although hybrid assemblies can generate higher contiguity they tend to suffer from lower quality. In addition, we found long-read-only assemblies can be optimized for contiguity by subsampling length-restricted raw reads. Our results indicate that long-read contig assembly is the current best choice and that assemblies from phase I and phase II were of lower quality.
Collapse
Affiliation(s)
- Niraj Rayamajhi
- Department of Evolution, Ecology, and Behavior, University of Illinois, Urbana-Champaign, Champaign, IL 61801, USA
| | - Chi-Hing Christina Cheng
- Department of Evolution, Ecology, and Behavior, University of Illinois, Urbana-Champaign, Champaign, IL 61801, USA
| | - Julian M Catchen
- Department of Evolution, Ecology, and Behavior, University of Illinois, Urbana-Champaign, Champaign, IL 61801, USA
| |
Collapse
|
49
|
Library adaptors with integrated reference controls improve the accuracy and reliability of nanopore sequencing. Nat Commun 2022; 13:6437. [PMID: 36307482 PMCID: PMC9616880 DOI: 10.1038/s41467-022-34028-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2021] [Accepted: 10/11/2022] [Indexed: 12/25/2022] Open
Abstract
Library adaptors are short oligonucleotides that are attached to RNA and DNA samples in preparation for next-generation sequencing (NGS). Adaptors can also include additional functional elements, such as sample indexes and unique molecular identifiers, to improve library analysis. Here, we describe Control Library Adaptors, termed CAPTORs, that measure the accuracy and reliability of NGS. CAPTORs can be integrated within the library preparation of RNA and DNA samples, and their encoded information is retrieved during sequencing. We show how CAPTORs can measure the accuracy of nanopore sequencing, evaluate the quantitative performance of metagenomic and RNA sequencing, and improve normalisation between samples. CAPTORs can also be customised for clinical diagnoses, correcting systematic sequencing errors and improving the diagnosis of pathogenic BRCA1/2 variants in breast cancer. CAPTORs are a simple and effective method to increase the accuracy and reliability of NGS, enabling comparisons between samples, reagents and laboratories, and supporting the use of nanopore sequencing for clinical diagnosis.
Collapse
|
50
|
Srinivas M, O’Sullivan O, Cotter PD, van Sinderen D, Kenny JG. The Application of Metagenomics to Study Microbial Communities and Develop Desirable Traits in Fermented Foods. Foods 2022; 11:3297. [PMID: 37431045 PMCID: PMC9601669 DOI: 10.3390/foods11203297] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Revised: 10/11/2022] [Accepted: 10/19/2022] [Indexed: 11/18/2022] Open
Abstract
The microbial communities present within fermented foods are diverse and dynamic, producing a variety of metabolites responsible for the fermentation processes, imparting characteristic organoleptic qualities and health-promoting traits, and maintaining microbiological safety of fermented foods. In this context, it is crucial to study these microbial communities to characterise fermented foods and the production processes involved. High Throughput Sequencing (HTS)-based methods such as metagenomics enable microbial community studies through amplicon and shotgun sequencing approaches. As the field constantly develops, sequencing technologies are becoming more accessible, affordable and accurate with a further shift from short read to long read sequencing being observed. Metagenomics is enjoying wide-spread application in fermented food studies and in recent years is also being employed in concert with synthetic biology techniques to help tackle problems with the large amounts of waste generated in the food sector. This review presents an introduction to current sequencing technologies and the benefits of their application in fermented foods.
Collapse
Affiliation(s)
- Meghana Srinivas
- Food Biosciences Department, Teagasc Food Research Centre, Moorepark, P61 C996 Cork, Ireland
- APC Microbiome Ireland, University College Cork, T12 CY82 Cork, Ireland
- School of Microbiology, University College Cork, T12 CY82 Cork, Ireland
| | - Orla O’Sullivan
- Food Biosciences Department, Teagasc Food Research Centre, Moorepark, P61 C996 Cork, Ireland
- APC Microbiome Ireland, University College Cork, T12 CY82 Cork, Ireland
- VistaMilk SFI Research Centre, Fermoy, P61 C996 Cork, Ireland
| | - Paul D. Cotter
- Food Biosciences Department, Teagasc Food Research Centre, Moorepark, P61 C996 Cork, Ireland
- APC Microbiome Ireland, University College Cork, T12 CY82 Cork, Ireland
- VistaMilk SFI Research Centre, Fermoy, P61 C996 Cork, Ireland
| | - Douwe van Sinderen
- APC Microbiome Ireland, University College Cork, T12 CY82 Cork, Ireland
- School of Microbiology, University College Cork, T12 CY82 Cork, Ireland
| | - John G. Kenny
- Food Biosciences Department, Teagasc Food Research Centre, Moorepark, P61 C996 Cork, Ireland
- APC Microbiome Ireland, University College Cork, T12 CY82 Cork, Ireland
- VistaMilk SFI Research Centre, Fermoy, P61 C996 Cork, Ireland
| |
Collapse
|