1
|
Schulz T, Medvedev P. ESKEMAP: exact sketch-based read mapping. Algorithms Mol Biol 2024; 19:19. [PMID: 38704605 PMCID: PMC11069465 DOI: 10.1186/s13015-024-00261-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 03/19/2024] [Indexed: 05/06/2024] Open
Abstract
BACKGROUND Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. RESULTS In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in O ( | t | + | p | + ℓ 2 ) time and O ( ℓ log ℓ ) space, where |t| is the number of k -mers inside the sketch of the reference, |p| is the number of k -mers inside the read's sketch and ℓ is the number of times that k -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm's performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.
Collapse
Affiliation(s)
- Tizian Schulz
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany.
- Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University, Bielefeld, Germany.
- Graduate School "Digital Infrastructure for the Life Sciences" (DILS), Bielefeld University, Bielefeld, Germany.
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA.
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, USA.
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, USA.
| |
Collapse
|
2
|
Li H, Durbin R. Genome assembly in the telomere-to-telomere era. Nat Rev Genet 2024:10.1038/s41576-024-00718-w. [PMID: 38649458 DOI: 10.1038/s41576-024-00718-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2024] [Indexed: 04/25/2024]
Abstract
Genome sequences largely determine the biology and encode the history of an organism, and de novo assembly - the process of reconstructing the genome sequence of an organism from sequencing reads - has been a central problem in bioinformatics for four decades. Until recently, genomes were typically assembled into fragments of a few megabases at best, but now technological advances in long-read sequencing enable the near-complete assembly of each chromosome - also known as telomere-to-telomere assembly - for many organisms. Here, we review recent progress on assembly algorithms and protocols, with a focus on how to derive near-telomere-to-telomere assemblies. We also discuss the additional developments that will be required to resolve remaining assembly gaps and to assemble non-diploid genomes.
Collapse
Affiliation(s)
- Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Richard Durbin
- Department of Genetics, Cambridge University, Cambridge, UK.
| |
Collapse
|
3
|
Almeida MV, Blumer M, Yuan CU, Sierra P, Price JL, Quah FX, Friman A, Dallaire A, Vernaz G, Putman ALK, Smith AM, Joyce DA, Butter F, Haase AD, Durbin R, Santos ME, Miska EA. Dynamic co-evolution of transposable elements and the piRNA pathway in African cichlid fishes. bioRxiv 2024:2024.04.01.587621. [PMID: 38617250 PMCID: PMC11014572 DOI: 10.1101/2024.04.01.587621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/16/2024]
Abstract
East African cichlid fishes have diversified in an explosive fashion, but the (epi)genetic basis of the phenotypic diversity of these fishes remains largely unknown. Although transposable elements (TEs) have been associated with phenotypic variation in cichlids, little is known about their transcriptional activity and epigenetic silencing. Here, we describe dynamic patterns of TE expression in African cichlid gonads and during early development. Orthology inference revealed an expansion of piwil1 genes in Lake Malawi cichlids, likely driven by PiggyBac TEs. The expanded piwil1 copies have signatures of positive selection and retain amino acid residues essential for catalytic activity. Furthermore, the gonads of African cichlids express a Piwi-interacting RNA (piRNA) pathway that target TEs. We define the genomic sites of piRNA production in African cichlids and find divergence in closely related species, in line with fast evolution of piRNA-producing loci. Our findings suggest dynamic co-evolution of TEs and host silencing pathways in the African cichlid radiations. We propose that this co-evolution has contributed to cichlid genomic diversity.
Collapse
Affiliation(s)
- Miguel Vasconcelos Almeida
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, CB2 1GA, UK
- Wellcome/CRUK Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK
| | - Moritz Blumer
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK
- These authors contributed equally
| | - Chengwei Ulrika Yuan
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, CB2 1GA, UK
- Wellcome/CRUK Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK
- These authors contributed equally
| | - Pío Sierra
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK
| | - Jonathan L. Price
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, CB2 1GA, UK
- Wellcome/CRUK Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK
| | - Fu Xiang Quah
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, CB2 1GA, UK
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK
| | - Aleksandr Friman
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD 20892, USA
- Biophysics Graduate Program, Institute for Physical Science and Technology, University of Maryland, College Park, Maryland 20742, USA
| | - Alexandra Dallaire
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, CB2 1GA, UK
- Wellcome/CRUK Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK
- Comparative Fungal Biology, Royal Botanic Gardens Kew, Jodrell Laboratory, Richmond TW9 3DS, UK
| | - Grégoire Vernaz
- Wellcome/CRUK Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK
- Present address: Zoological Institute, Department of Environmental Sciences, University of Basel, Vesalgasse 1, Basel, 4051, Switzerland
| | - Audrey L. K. Putman
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, CB2 1GA, UK
- Wellcome/CRUK Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK
| | - Alan M. Smith
- School of Natural Sciences, University of Hull, Hull, HU6 7RX, UK
| | - Domino A. Joyce
- School of Natural Sciences, University of Hull, Hull, HU6 7RX, UK
| | - Falk Butter
- Institute of Molecular Biology (IMB), Quantitative Proteomics, Ackermannweg 4, Mainz, 55128, Germany
- Institute of Molecular Virology and Cell Biology, Friedrich-Loeffler-Institute, Südufer, Greifswald, 17493, Germany
| | - Astrid D. Haase
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Richard Durbin
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK
- Wellcome Sanger Institute, Tree of Life, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - M. Emília Santos
- Department of Zoology, University of Cambridge, Downing Street, Cambridge, CB2 3EJ, UK
| | - Eric A. Miska
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, CB2 1GA, UK
- Wellcome/CRUK Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK
- Wellcome Sanger Institute, Tree of Life, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| |
Collapse
|
4
|
Joe S, Park JL, Kim J, Kim S, Park JH, Yeo MK, Lee D, Yang JO, Kim SY. Comparison of structural variant callers for massive whole-genome sequence data. BMC Genomics 2024; 25:318. [PMID: 38549092 PMCID: PMC10976732 DOI: 10.1186/s12864-024-10239-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 03/18/2024] [Indexed: 04/01/2024] Open
Abstract
BACKGROUND Detecting structural variations (SVs) at the population level using next-generation sequencing (NGS) requires substantial computational resources and processing time. Here, we compared the performances of 11 SV callers: Delly, Manta, GridSS, Wham, Sniffles, Lumpy, SvABA, Canvas, CNVnator, MELT, and INSurVeyor. These SV callers have been recently published and have been widely employed for processing massive whole-genome sequencing datasets. We evaluated the accuracy, sequence depth, running time, and memory usage of the SV callers. RESULTS Notably, several callers exhibited better calling performance for deletions than for duplications, inversions, and insertions. Among the SV callers, Manta identified deletion SVs with better performance and efficient computing resources, and both Manta and MELT demonstrated relatively good precision regarding calling insertions. We confirmed that the copy number variation callers, Canvas and CNVnator, exhibited better performance in identifying long duplications as they employ the read-depth approach. Finally, we also verified the genotypes inferred from each SV caller using a phased long-read assembly dataset, and Manta showed the highest concordance in terms of the deletions and insertions. CONCLUSIONS Our findings provide a comprehensive understanding of the accuracy and computational efficiency of SV callers, thereby facilitating integrative analysis of SV profiles in diverse large-scale genomic datasets.
Collapse
Grants
- NRF-2020M3E5D708517212, 2020M3A9I6A0103605713 Ministry of Science and ICT, South Korea
- NRF-2020M3E5D708517212, 2020M3A9I6A0103605713 Ministry of Science and ICT, South Korea
- NRF-2020M3E5D708517212, 2020M3A9I6A0103605713 Ministry of Science and ICT, South Korea
- NRF-2020M3E5D708517212, 2020M3A9I6A0103605713 Ministry of Science and ICT, South Korea
- NRF-2020M3E5D708517212, 2020M3A9I6A0103605713 Ministry of Science and ICT, South Korea
- NRF-2020M3E5D708517212, 2020M3A9I6A0103605713 Ministry of Science and ICT, South Korea
- NRF-2020M3E5D708517212, 2020M3A9I6A0103605713 Ministry of Science and ICT, South Korea
- NRF-2020M3E5D708517212, 2020M3A9I6A0103605713 Ministry of Science and ICT, South Korea
- NTIS-1711170620 KRIBB Research Initiative Program
- NTIS-1711170620 KRIBB Research Initiative Program
- NTIS-1711170620 KRIBB Research Initiative Program
- NTIS-1711170620 KRIBB Research Initiative Program
- NTIS-1711170620 KRIBB Research Initiative Program
- NTIS-1711170620 KRIBB Research Initiative Program
- NTIS-1711170620 KRIBB Research Initiative Program
Collapse
Affiliation(s)
- Soobok Joe
- Korea Bioinformation Center (KOBIC), Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
| | - Jong-Lyul Park
- Aging Convergence Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
- Department of Functional Genomics, University of Science and Technology (UST), 34113, Daejeon, Republic of Korea
| | - Jun Kim
- Department of Convergent Bioscience and Informatics, College of Bioscience and Biotechnology, Chungnam National University, Daejeon, 34134, Republic of Korea
| | - Sangok Kim
- Korea Bioinformation Center (KOBIC), Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
| | - Ji-Hwan Park
- Korea Bioinformation Center (KOBIC), Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
- Department of Bioscience, University of Science and Technology (UST), Daejeon, 34113, Republic of Korea
| | - Min-Kyung Yeo
- Department of Pathology, Chungnam National University School of Medicine, Daejeon, 35015, Republic of Korea
| | - Dongyoon Lee
- Korea Bioinformation Center (KOBIC), Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
| | - Jin Ok Yang
- Korea Bioinformation Center (KOBIC), Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea.
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea.
| | - Seon-Young Kim
- Korea Bioinformation Center (KOBIC), Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea.
- Department of Bioscience, University of Science and Technology (UST), Daejeon, 34113, Republic of Korea.
| |
Collapse
|
5
|
Ermini L, Driguez P. The Application of Long-Read Sequencing to Cancer. Cancers (Basel) 2024; 16:1275. [PMID: 38610953 PMCID: PMC11011098 DOI: 10.3390/cancers16071275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 03/20/2024] [Accepted: 03/21/2024] [Indexed: 04/14/2024] Open
Abstract
Cancer is a multifaceted disease arising from numerous genomic aberrations that have been identified as a result of advancements in sequencing technologies. While next-generation sequencing (NGS), which uses short reads, has transformed cancer research and diagnostics, it is limited by read length. Third-generation sequencing (TGS), led by the Pacific Biosciences and Oxford Nanopore Technologies platforms, employs long-read sequences, which have marked a paradigm shift in cancer research. Cancer genomes often harbour complex events, and TGS, with its ability to span large genomic regions, has facilitated their characterisation, providing a better understanding of how complex rearrangements affect cancer initiation and progression. TGS has also characterised the entire transcriptome of various cancers, revealing cancer-associated isoforms that could serve as biomarkers or therapeutic targets. Furthermore, TGS has advanced cancer research by improving genome assemblies, detecting complex variants, and providing a more complete picture of transcriptomes and epigenomes. This review focuses on TGS and its growing role in cancer research. We investigate its advantages and limitations, providing a rigorous scientific analysis of its use in detecting previously hidden aberrations missed by NGS. This promising technology holds immense potential for both research and clinical applications, with far-reaching implications for cancer diagnosis and treatment.
Collapse
Affiliation(s)
- Luca Ermini
- NORLUX Neuro-Oncology Laboratory, Department of Cancer Research, Luxembourg Institute of Health, L-1210 Luxembourg, Luxembourg
| | - Patrick Driguez
- Bioscience Core Lab, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
6
|
Mo C, Wang H, Wei M, Zeng Q, Zhang X, Fei Z, Zhang Y, Kong Q. Complete genome assembly provides a high-quality skeleton for pan-NLRome construction in melon. Plant J 2024. [PMID: 38430487 DOI: 10.1111/tpj.16705] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 02/16/2024] [Accepted: 02/22/2024] [Indexed: 03/03/2024]
Abstract
Melon (Cucumis melo L.), being under intensive domestication and selective breeding, displays an abundant phenotypic diversity. Wild germplasm with tolerance to stress represents an untapped genetic resource for discovery of disease-resistance genes. To comprehensively characterize resistance genes in melon, we generate a telomere-to-telomere (T2T) and gap-free genome of wild melon accession PI511890 (C. melo var. chito) with a total length of 375.0 Mb and a contig N50 of 31.24 Mb. The complete genome allows us to dissect genome architecture and identify resistance gene analogs. We construct a pan-NLRome using seven melon genomes, which include 208 variable and 18 core nucleotide-binding leucine-rich repeat receptors (NLRs). Multiple disease-related transcriptome analyses indicate that most up-regulated NLRs induced by pathogens are shell or cloud NLRs. The T2T gap-free assembly and the pan-NLRome not only serve as essential resources for genomic studies and molecular breeding of melon but also provide insights into the genome architecture and NLR diversity.
Collapse
Affiliation(s)
- Changjuan Mo
- National Key Laboratory for Germplasm Innovation and Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China
| | - Haiyan Wang
- National Key Laboratory for Germplasm Innovation and Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China
| | - Minghua Wei
- National Key Laboratory for Germplasm Innovation and Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China
| | - Qingguo Zeng
- National Key Laboratory for Germplasm Innovation and Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China
| | - Xuejun Zhang
- Hami-melon Research Center, Xinjiang Academy of Agricultural Sciences, Urumqi, 830091, China
| | | | - Yongbing Zhang
- Hami-melon Research Center, Xinjiang Academy of Agricultural Sciences, Urumqi, 830091, China
| | - Qiusheng Kong
- National Key Laboratory for Germplasm Innovation and Utilization of Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China
| |
Collapse
|
7
|
Rajput J, Chandra G, Jain C. Co-linear chaining on pangenome graphs. Algorithms Mol Biol 2024; 19:4. [PMID: 38279113 DOI: 10.1186/s13015-024-00250-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 01/02/2024] [Indexed: 01/28/2024] Open
Abstract
Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width and how incorporating gap cost in the scoring function improves alignment accuracy. However, it remains open on how to effectively generalize these techniques for general pangenome graphs which contain cycles. Here we present the first practical formulation and an exact algorithm for co-linear chaining on cyclic pangenome graphs. We rigorously prove the correctness and computational complexity of the proposed algorithm. We evaluate the empirical performance of our algorithm by aligning simulated long reads from the human genome to a cyclic pangenome graph constructed from 95 publicly available haplotype-resolved human genome assemblies. While the existing heuristic-based algorithms are faster, the proposed algorithm provides a significant advantage in terms of accuracy. Implementation ( https://github.com/at-cg/PanAligner ).
Collapse
Affiliation(s)
- Jyotshna Rajput
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, Karnataka, India
| | - Ghanshyam Chandra
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, Karnataka, India
| | - Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, Karnataka, India.
| |
Collapse
|
8
|
Wei ZG, Zhang XD, Fan XG, Qian Y, Liu F, Wu FX. pathMap: a path-based mapping tool for long noisy reads with high sensitivity. Brief Bioinform 2024; 25:bbae107. [PMID: 38517696 PMCID: PMC10959152 DOI: 10.1093/bib/bbae107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Revised: 12/25/2023] [Accepted: 02/28/2024] [Indexed: 03/24/2024] Open
Abstract
With the rapid development of single-molecule sequencing (SMS) technologies, the output read length is continuously increasing. Mapping such reads onto a reference genome is one of the most fundamental tasks in sequence analysis. Mapping sensitivity is becoming a major concern since high sensitivity can detect more aligned regions on the reference and obtain more aligned bases, which are useful for downstream analysis. In this study, we present pathMap, a novel k-mer graph-based mapper that is specifically designed for mapping SMS reads with high sensitivity. By viewing the alignment chain as a path containing as many anchors as possible in the matched k-mer graph, pathMap treats chaining as a path selection problem in the directed graph. pathMap iteratively searches the longest path in the remaining nodes; more candidate chains with high quality can be effectively detected and aligned. Compared to other state-of-the-art mapping methods such as minimap2 and Winnowmap2, experiment results on simulated and real-life datasets demonstrate that pathMap obtains the number of mapped chains at least 11.50% more than its closest competitor and increases the mapping sensitivity by 17.28% and 13.84% of bases over the next-best mapper for Pacific Biosciences and Oxford Nanopore sequencing data, respectively. In addition, pathMap is more robust to sequence errors and more sensitive to species- and strain-specific identification of pathogens using MinION reads.
Collapse
Affiliation(s)
- Ze-Gang Wei
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
- Division of Biomedical Engineering, Department of Computer Science and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| | - Xiao-Dan Zhang
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| | - Xing-Guo Fan
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| | - Yu Qian
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| | - Fei Liu
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, Department of Computer Science and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| |
Collapse
|
9
|
Hall MB, Coin LJM. Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data. Gigascience 2024; 13:giae010. [PMID: 38573185 PMCID: PMC10993716 DOI: 10.1093/gigascience/giae010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 01/10/2024] [Accepted: 02/27/2024] [Indexed: 04/05/2024] Open
Abstract
BACKGROUND Culture-free real-time sequencing of clinical metagenomic samples promises both rapid pathogen detection and antimicrobial resistance profiling. However, this approach introduces the risk of patient DNA leakage. To mitigate this risk, we need near-comprehensive removal of human DNA sequences at the point of sequencing, typically involving the use of resource-constrained devices. Existing benchmarks have largely focused on the use of standardized databases and largely ignored the computational requirements of depletion pipelines as well as the impact of human genome diversity. RESULTS We benchmarked host removal pipelines on simulated and artificial real Illumina and Nanopore metagenomic samples. We found that construction of a custom kraken database containing diverse human genomes results in the best balance of accuracy and computational resource usage. In addition, we benchmarked pipelines using kraken and minimap2 for taxonomic classification of Mycobacterium reads using standard and custom databases. With a database representative of the Mycobacterium genus, both tools obtained improved specificity and sensitivity, compared to the standard databases for classification of Mycobacterium tuberculosis. Computational efficiency of these custom databases was superior to most standard approaches, allowing them to be executed on a laptop device. CONCLUSIONS Customized pangenome databases provide the best balance of accuracy and computational efficiency when compared to standard databases for the task of human read removal and M. tuberculosis read classification from metagenomic samples. Such databases allow for execution on a laptop, without sacrificing accuracy, an especially important consideration in low-resource settings. We make all customized databases and pipelines freely available.
Collapse
Affiliation(s)
- Michael B Hall
- Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, 3000 Victoria, Australia
| | - Lachlan J M Coin
- Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, 3000 Victoria, Australia
| |
Collapse
|
10
|
Valerio F, Twort VG, Duplouy A. Screening Host Genomic Data for Wolbachia Infections. Methods Mol Biol 2024; 2739:251-274. [PMID: 38006557 DOI: 10.1007/978-1-0716-3553-7_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2023]
Abstract
Less than a decade ago, the production of Wolbachia genomic assemblies was tedious, time-consuming, and expensive. The production of Wolbachia genomic DNA free of contamination from host DNA, as required for Wolbachia-targeted sequencing, was then only possible after the amplification and extraction of a large amount of clonal Wolbachia DNA. However, as an endosymbiotic bacterium, Wolbachia does not grow outside the host cell environment, and large-scale recovery of the bacteria required mass rearing of their host, preferably clones of a single individual to avoid strain genetic diversity, or amplification of cell cultures infected with a single Wolbachia strain. Bacterial DNA could be separated from host DNA based on genomic size. Nowadays, the production of full Wolbachia genomes does not require the physical isolation of the bacterial strains from their respective hosts, and the bacterium is often sequenced as a by-catch of host genomic projects. Here, we provide a step-by-step protocol to (1) identify whether host genome projects contain reads from associated Wolbachia and (2) isolate/retrieve the Wolbachia reads from the rest of the sequenced material. We hope this simple protocol will support many projects aiming at studying diverse Wolbachia genome assemblies.
Collapse
Affiliation(s)
- Federica Valerio
- Insect Symbiosis Ecology and Evolution, Organismal and Evolutionary Biology Research Program, Faculty of Biological and Environmental Sciences, University of Helsinki, Helsinki, Finland
- Research Centre for Ecological Changes, University of Helsinki, Helsinki, Finland
| | - Victoria G Twort
- The Finnish Museum of Natural History, Luomus, University of Helsinki, Helsinki, Finland
| | - Anne Duplouy
- Insect Symbiosis Ecology and Evolution, Organismal and Evolutionary Biology Research Program, Faculty of Biological and Environmental Sciences, University of Helsinki, Helsinki, Finland.
- Research Centre for Ecological Changes, University of Helsinki, Helsinki, Finland.
| |
Collapse
|
11
|
Chen NC, Paulin LF, Sedlazeck FJ, Koren S, Phillippy AM, Langmead B. Improved sequence mapping using a complete reference genome and lift-over. Nat Methods 2024; 21:41-49. [PMID: 38036856 DOI: 10.1038/s41592-023-02069-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Accepted: 10/09/2023] [Indexed: 12/02/2023]
Abstract
Complete, telomere-to-telomere (T2T) genome assemblies promise improved analyses and the discovery of new variants, but many essential genomic resources remain associated with older reference genomes. Thus, there is a need to translate genomic features and read alignments between references. Here we describe a method called levioSAM2 that performs fast and accurate lift-over between assemblies using a whole-genome map. In addition to enabling the use of several references, we demonstrate that aligning reads to a high-quality reference (for example, T2T-CHM13) and lifting to an older reference (for example, Genome reference Consortium (GRC)h38) improves the accuracy of the resulting variant calls on the old reference. By leveraging the quality improvements of T2T-CHM13, levioSAM2 reduces small and structural variant calling errors compared with GRC-based mapping using real short- and long-read datasets. Performance is especially improved for a set of complex medically relevant genes, where the GRC references are lower quality.
Collapse
Affiliation(s)
- Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
| | - Luis F Paulin
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
12
|
Hoang M, Marçais G, Kingsford C. Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme. J Comput Biol 2024; 31:2-20. [PMID: 37975802 PMCID: PMC10794853 DOI: 10.1089/cmb.2023.0212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2023] Open
Abstract
Minimizers and syncmers are sketching methods that sample representative k-mer seeds from a long string. The minimizer scheme guarantees a well-spread k-mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved. The parameterized syncmer scheme was recently introduced as a generalization of syncmers with more flexible sampling rules and empirically better coverage than the original syncmer variants. However, no approach exists to optimize parameterized syncmers. To address this shortcoming, we introduce a new scheme called masked minimizers that generalizes minimizers in manner analogous to how parameterized syncmers generalize syncmers and allows us to extend existing optimization techniques developed for minimizers. This results in a practical algorithm to optimize the masked minimizer scheme with respect to both density and conservation. We evaluate the optimization algorithm on various benchmark genomes and show that our algorithm finds sketches that are overall more compact, well-spread, and robust to substitutions than those found by previous methods. Our implementation is released at https://github.com/Kingsford-Group/maskedminimizer. This new technique will enable more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used.
Collapse
Affiliation(s)
- Minh Hoang
- Department of Computer Science, and Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Guillaume Marçais
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
13
|
LoTempio J, Delot E, Vilain E. Benchmarking long-read genome sequence alignment tools for human genomics applications. PeerJ 2023; 11:e16515. [PMID: 38130927 PMCID: PMC10734412 DOI: 10.7717/peerj.16515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Accepted: 11/02/2023] [Indexed: 12/23/2023] Open
Abstract
Background The utility of long-read genome sequencing platforms has been shown in many fields including whole genome assembly, metagenomics, and amplicon sequencing. Less clear is the applicability of long reads to reference-guided human genomics, which is the foundation of genomic medicine. Here, we benchmark available platform-agnostic alignment tools on datasets from nanopore and single-molecule real-time platforms to understand their suitability in producing a genome representation. Results For this study, we leveraged publicly-available data from sample NA12878 generated on Oxford Nanopore and sample NA24385 on Pacific Biosciences platforms. We employed state of the art sequence alignment tools including GraphMap2, long-read aligner (LRA), Minimap2, CoNvex Gap-cost alignMents for Long Reads (NGMLR), and Winnowmap2. Minimap2 and Winnowmap2 were computationally lightweight enough for use at scale, while GraphMap2 was not. NGMLR took a long time and required many resources, but produced alignments each time. LRA was fast, but only worked on Pacific Biosciences data. Each tool widely disagreed on which reads to leave unaligned, affecting the end genome coverage and the number of discoverable breakpoints. No alignment tool independently resolved all large structural variants (1,001-100,000 base pairs) present in the Database of Genome Variants (DGV) for sample NA12878 or the truthset for NA24385. Conclusions These results suggest a combined approach is needed for LRS alignments for human genomics. Specifically, leveraging alignments from three tools will be more effective in generating a complete picture of genomic variability. It should be best practice to use an analysis pipeline that generates alignments with both Minimap2 and Winnowmap2 as they are lightweight and yield different views of the genome. Depending on the question at hand, the data available, and the time constraints, NGMLR and LRA are good options for a third tool. If computational resources and time are not a factor for a given case or experiment, NGMLR will provide another view, and another chance to resolve a case. LRA, while fast, did not work on the nanopore data for our cluster, but PacBio results were promising in that those computations completed faster than Minimap2. Due to its significant burden on computational resources and slow run time, Graphmap2 is not an ideal tool for exploration of a whole human genome generated on a long-read sequencing platform.
Collapse
Affiliation(s)
- Jonathan LoTempio
- Institute for Clinical and Translational Science, University of California, Irvine, CA, United States of America
- International Research Laboratory (IRL2006) “Epigenetics, Data, Politics (EpiDaPo)”, Centre National de la Recherche Scientifique, Washington, DC, United States of America
| | - Emmanuele Delot
- Center for Genetic Medicine Research, Children’s National Hospital, Washington, DC, United States of America
- Department of Genomics and Precision Medicine, George Washington University, Washington, DC, United States of America
| | - Eric Vilain
- Institute for Clinical and Translational Science, University of California, Irvine, CA, United States of America
- International Research Laboratory (IRL2006) “Epigenetics, Data, Politics (EpiDaPo)”, Centre National de la Recherche Scientifique, Washington, DC, United States of America
| |
Collapse
|
14
|
Zhu F, Yin ZT, Zhao QS, Sun YX, Jie YC, Smith J, Yang YZ, Burt DW, Hincke M, Zhang ZD, Yuan MD, Kaufman J, Sun CJ, Li JY, Shao LW, Yang N, Hou ZC. A chromosome-level genome assembly for the Silkie chicken resolves complete sequences for key chicken metabolic, reproductive, and immunity genes. Commun Biol 2023; 6:1233. [PMID: 38057566 PMCID: PMC10700341 DOI: 10.1038/s42003-023-05619-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 11/21/2023] [Indexed: 12/08/2023] Open
Abstract
A set of high-quality pan-genomes would help identify important genes that are still hidden/incomplete in bird reference genomes. In an attempt to address these issues, we have assembled a de novo chromosome-level reference genome of the Silkie (Gallus gallus domesticus), which is an important avian model for unique traits, like fibromelanosis, with unclear genetic foundation. This Silkie genome includes the complete genomic sequences of well-known, but unresolved, evolutionarily, endocrinologically, and immunologically important genes, including leptin, ovocleidin-17, and tumor-necrosis factor-α. The gap-less and manually annotated MHC (major histocompatibility complex) region possesses 38 recently identified genes, with differentially regulated genes recovered in response to pathogen challenges. We also provide whole-genome methylation and genetic variation maps, and resolve a complex genetic region that may contribute to fibromelanosis in these animals. Finally, we experimentally show leptin binding to the identified leptin receptor in chicken, confirming an active leptin ligand-receptor system. The Silkie genome assembly not only provides a rich data resource for avian genome studies, but also lays a foundation for further functional validation of resolved genes.
Collapse
Affiliation(s)
- Feng Zhu
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Zhong-Tao Yin
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Qiang-Sen Zhao
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Yun-Xiao Sun
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Yu-Chen Jie
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Jacqueline Smith
- The Roslin Institute & R(D)SVS, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, UK
| | - Yu-Ze Yang
- Beijing General Station of Animal Husbandry, 100101, Beijing, China
| | - David W Burt
- The Roslin Institute & R(D)SVS, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, UK
- The University of Queensland, St. Lucia, QLD, 4072, Australia
| | - Maxwell Hincke
- Department of Cellular and Molecular Medicine, Department of Innovation in Medical Education, Faculty of Medicine, University of Ottawa, 451 Smyth Road, Ottawa, KIH 8M5, Canada
| | - Zi-Ding Zhang
- College of Biological Sciences, China Agricultural University, 100193, Beijing, China
| | - Meng-Di Yuan
- College of Biological Sciences, China Agricultural University, 100193, Beijing, China
| | - Jim Kaufman
- Institute for Immunology and Infection Research, University of Edinburgh, Edinburgh, EH9 3FL, UK
- Department of Pathology, University of Cambridge, Cambridge, CB2 1QP, UK
| | - Cong-Jiao Sun
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Jun-Ying Li
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Li-Wa Shao
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China.
| | - Ning Yang
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China.
| | - Zhuo-Cheng Hou
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China.
- Sanya Institute of China Agricultural University, Beijing, China.
| |
Collapse
|
15
|
Jia P, Dong L, Yang X, Wang B, Bush SJ, Wang T, Lin J, Wang S, Zhao X, Xu T, Che Y, Dang N, Ren L, Zhang Y, Wang X, Liang F, Wang Y, Ruan J, Xia H, Zheng Y, Shi L, Lv Y, Wang J, Ye K. Haplotype-resolved assemblies and variant benchmark of a Chinese Quartet. Genome Biol 2023; 24:277. [PMID: 38049885 PMCID: PMC10694985 DOI: 10.1186/s13059-023-03116-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 11/21/2023] [Indexed: 12/06/2023] Open
Abstract
BACKGROUND Recent state-of-the-art sequencing technologies enable the investigation of challenging regions in the human genome and expand the scope of variant benchmarking datasets. Herein, we sequence a Chinese Quartet, comprising two monozygotic twin daughters and their biological parents, using four short and long sequencing platforms (Illumina, BGI, PacBio, and Oxford Nanopore Technology). RESULTS The long reads from the monozygotic twin daughters are phased into paternal and maternal haplotypes using the parent-child genetic map and for each haplotype. We also use long reads to generate haplotype-resolved whole-genome assemblies with completeness and continuity exceeding that of GRCh38. Using this Quartet, we comprehensively catalogue the human variant landscape, generating a dataset of 3,962,453 SNVs, 886,648 indels (< 50 bp), 9726 large deletions (≥ 50 bp), 15,600 large insertions (≥ 50 bp), 40 inversions, 31 complex structural variants, and 68 de novo mutations which are shared between the monozygotic twin daughters. Variants underrepresented in previous benchmarks owing to their complexity-including those located at long repeat regions, complex structural variants, and de novo mutations-are systematically examined in this study. CONCLUSIONS In summary, this study provides high-quality haplotype-resolved assemblies and a comprehensive set of benchmarking resources for two Chinese monozygotic twin samples which, relative to existing benchmarks, offers expanded genomic coverage and insight into complex variant categories.
Collapse
Affiliation(s)
- Peng Jia
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Lianhua Dong
- National Institute of Metrology, Beijing, 100029, China
| | - Xiaofei Yang
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- School of Computer Science and Technology, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- Genome Institute, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
| | - Bo Wang
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Stephen J Bush
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Tingjie Wang
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Jiadong Lin
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Songbo Wang
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Xixi Zhao
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- School of Computer Science and Technology, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Tun Xu
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Yizhuo Che
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Ningxin Dang
- Genome Institute, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
| | - Luyao Ren
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, 200438, China
| | - Yujing Zhang
- National Institute of Metrology, Beijing, 100029, China
| | - Xia Wang
- National Institute of Metrology, Beijing, 100029, China
| | - Fan Liang
- GrandOmics Biosciences, Beijing, 100089, China
| | - Yang Wang
- GrandOmics Biosciences, Beijing, 100089, China
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518120, China
| | - Han Xia
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, 200438, China
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, 200438, China
| | - Yi Lv
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China.
| | - Jing Wang
- National Institute of Metrology, Beijing, 100029, China.
| | - Kai Ye
- National Local Joint Engineering Research Center for Precision Surgery & Regenerative Medicine, Center for Mathematical Medical, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China.
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China.
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, China.
- Genome Institute, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China.
- School of Life Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China.
- Faculty of Science, Leiden University, Leiden, 2311EZ, The Netherlands.
| |
Collapse
|
16
|
Makova KD, Pickett BD, Harris RS, Hartley GA, Cechova M, Pal K, Nurk S, Yoo D, Li Q, Hebbar P, McGrath BC, Antonacci F, Aubel M, Biddanda A, Borchers M, Bomberg E, Bouffard GG, Brooks SY, Carbone L, Carrel L, Carroll A, Chang PC, Chin CS, Cook DE, Craig SJ, de Gennaro L, Diekhans M, Dutra A, Garcia GH, Grady PG, Green RE, Haddad D, Hallast P, Harvey WT, Hickey G, Hillis DA, Hoyt SJ, Jeong H, Kamali K, Kosakovsky Pond SL, LaPolice TM, Lee C, Lewis AP, Loh YHE, Masterson P, McCoy RC, Medvedev P, Miga KH, Munson KM, Pak E, Paten B, Pinto BJ, Potapova T, Rhie A, Rocha JL, Ryabov F, Ryder OA, Sacco S, Shafin K, Shepelev VA, Slon V, Solar SJ, Storer JM, Sudmant PH, Sweetalana, Sweeten A, Tassia MG, Thibaud-Nissen F, Ventura M, Wilson MA, Young AC, Zeng H, Zhang X, Szpiech ZA, Huber CD, Gerton JL, Yi SV, Schatz MC, Alexandrov IA, Koren S, O’Neill RJ, Eichler E, Phillippy AM. The Complete Sequence and Comparative Analysis of Ape Sex Chromosomes. bioRxiv 2023:2023.11.30.569198. [PMID: 38077089 PMCID: PMC10705393 DOI: 10.1101/2023.11.30.569198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/24/2023]
Abstract
Apes possess two sex chromosomes-the male-specific Y and the X shared by males and females. The Y chromosome is crucial for male reproduction, with deletions linked to infertility. The X chromosome carries genes vital for reproduction and cognition. Variation in mating patterns and brain function among great apes suggests corresponding differences in their sex chromosome structure and evolution. However, due to their highly repetitive nature and incomplete reference assemblies, ape sex chromosomes have been challenging to study. Here, using the state-of-the-art experimental and computational methods developed for the telomere-to-telomere (T2T) human genome, we produced gapless, complete assemblies of the X and Y chromosomes for five great apes (chimpanzee, bonobo, gorilla, Bornean and Sumatran orangutans) and a lesser ape, the siamang gibbon. These assemblies completely resolved ampliconic, palindromic, and satellite sequences, including the entire centromeres, allowing us to untangle the intricacies of ape sex chromosome evolution. We found that, compared to the X, ape Y chromosomes vary greatly in size and have low alignability and high levels of structural rearrangements. This divergence on the Y arises from the accumulation of lineage-specific ampliconic regions and palindromes (which are shared more broadly among species on the X) and from the abundance of transposable elements and satellites (which have a lower representation on the X). Our analysis of Y chromosome genes revealed lineage-specific expansions of multi-copy gene families and signatures of purifying selection. In summary, the Y exhibits dynamic evolution, while the X is more stable. Finally, mapping short-read sequencing data from >100 great ape individuals revealed the patterns of diversity and selection on their sex chromosomes, demonstrating the utility of these reference assemblies for studies of great ape evolution. These complete sex chromosome assemblies are expected to further inform conservation genetics of nonhuman apes, all of which are endangered species.
Collapse
Affiliation(s)
| | - Brandon D. Pickett
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | | | - Monika Cechova
- University of California Santa Cruz, Santa Cruz, CA, USA
| | - Karol Pal
- Penn State University, University Park, PA, USA
| | - Sergey Nurk
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - DongAhn Yoo
- University of Washington School of Medicine, Seattle, WA, USA
| | - Qiuhui Li
- Johns Hopkins University, Baltimore, MD, USA
| | - Prajna Hebbar
- University of California Santa Cruz, Santa Cruz, CA, USA
| | | | | | | | | | | | - Erich Bomberg
- University of Münster, Münster, Germany
- MPI for Developmental Biology, Tübingen, Germany
| | - Gerard G. Bouffard
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Shelise Y. Brooks
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Lucia Carbone
- Oregon Health & Science University, Portland, OR, USA
- Oregon National Primate Research Center, Hillsboro, OR, USA
| | - Laura Carrel
- Penn State University School of Medicine, Hershey, PA, USA
| | | | | | - Chen-Shan Chin
- Foundation of Biological Data Sciences, Belmont, CA, USA
| | | | | | | | - Mark Diekhans
- University of California Santa Cruz, Santa Cruz, CA, USA
| | - Amalia Dutra
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Gage H. Garcia
- University of Washington School of Medicine, Seattle, WA, USA
| | | | | | - Diana Haddad
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Pille Hallast
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | | | - Glenn Hickey
- University of California Santa Cruz, Santa Cruz, CA, USA
| | - David A. Hillis
- University of California Santa Barbara, Santa Barbara, CA, USA
| | | | - Hyeonsoo Jeong
- University of Washington School of Medicine, Seattle, WA, USA
| | | | | | | | - Charles Lee
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | | | | | - Patrick Masterson
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | | | - Karen H. Miga
- University of California Santa Cruz, Santa Cruz, CA, USA
| | | | - Evgenia Pak
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Benedict Paten
- University of California Santa Cruz, Santa Cruz, CA, USA
| | | | | | - Arang Rhie
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Fedor Ryabov
- Masters Program in National Research University Higher School of Economics, Moscow, Russia
| | | | - Samuel Sacco
- University of California Santa Cruz, Santa Cruz, CA, USA
| | | | | | | | - Steven J. Solar
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | | | - Sweetalana
- Penn State University, University Park, PA, USA
| | - Alex Sweeten
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Johns Hopkins University, Baltimore, MD, USA
| | | | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | | | - Alice C. Young
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Xinru Zhang
- Penn State University, University Park, PA, USA
| | | | | | | | - Soojin V. Yi
- University of California Santa Barbara, Santa Barbara, CA, USA
| | | | | | - Sergey Koren
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Evan Eichler
- University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Adam M. Phillippy
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
17
|
Zheng H, Marçais G, Kingsford C. Creating and Using Minimizer Sketches in Computational Genomics. J Comput Biol 2023; 30:1251-1276. [PMID: 37646787 DOI: 10.1089/cmb.2023.0094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023] Open
Abstract
Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computer Science Department, Princeton University, Princeton, New Jersey, USA
| | - Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
18
|
Daida K, Funayama M, Billingsley KJ, Malik L, Miano-Burkhardt A, Leonard HL, Makarious MB, Iwaki H, Ding J, Gibbs JR, Ishiguro M, Yoshino H, Ogaki K, Oyama G, Nishioka K, Nonaka R, Akamatsu W, Blauwendraat C, Hattori N. Long-Read Sequencing Resolves a Complex Structural Variant in PRKN Parkinson's Disease. Mov Disord 2023; 38:2249-2257. [PMID: 37926948 PMCID: PMC10843047 DOI: 10.1002/mds.29610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 08/17/2023] [Accepted: 09/11/2023] [Indexed: 11/07/2023] Open
Abstract
BACKGROUND Parkin RBR E3 ubiquitin-protein ligase (PRKN) mutations are the most common cause of young onset and autosomal recessive Parkinson's disease (PD). PRKN is located in FRA6E, which is one of the common fragile sites in the human genome, making this region prone to structural variants. However, complex structural variants such as inversions of PRKN are seldom reported, suggesting that there are potentially unrevealed complex pathogenic PRKN structural variants. OBJECTIVES To identify complex structural variants in PRKN using long-read sequencing. METHODS We investigated the genetic cause of monozygotic twins presenting with a young onset dystonia-parkinsonism using targeted sequencing, whole exome sequencing, multiple ligation probe amplification, and long-read sequencing. We assessed the presence and frequency of complex inversions overlapping PRKN using whole-genome sequencing data of Accelerating Medicines Partnership Parkinson's disease (AMP-PD) and United Kingdom (UK)-Biobank datasets. RESULTS Multiple ligation probe amplification identified a heterozygous exon three deletion in PRKN and long-read sequencing identified a large novel inversion spanning over 7 Mb, including a large part of the coding DNA sequence of PRKN. We could diagnose the affected subjects as compound heterozygous carriers of PRKN. We analyzed whole genome sequencing data of 43,538 participants of the UK-Biobank and 4941 participants of the AMP-PD datasets. Nine inversions in the UK-Biobank and two in AMP PD were identified and were considered potentially damaging and likely to affect PRKN expression. CONCLUSIONS This is the first report describing a large 7 Mb inversion involving breakpoints outside of PRKN. This study highlights the importance of using long-read sequencing for structural variant analysis in unresolved young-onset PD cases. © 2023 The Authors. Movement Disorders published by Wiley Periodicals LLC on behalf of International Parkinson and Movement Disorder Society. This article has been contributed to by U.S. Government employees and their work is in the public domain in the USA.
Collapse
Affiliation(s)
- Kensuke Daida
- Integrative Neurogenomics Unit, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Department of Neurology, Faculty of Medicine, Juntendo University, Hongo, Tokyo, Japan
| | - Manabu Funayama
- Department of Neurology, Faculty of Medicine, Juntendo University, Hongo, Tokyo, Japan
- Research Institute for Diseases of Old Age, Graduate School of Medicine, Juntendo University, Tokyo, Japan
| | - Kimberley J Billingsley
- Molecular Genetics Section, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Laksh Malik
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Abigail Miano-Burkhardt
- Molecular Genetics Section, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Hampton L. Leonard
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Molecular Genetics Section, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
- Data Tecnica International LLC, Washington, DC, USA
- German Center for Neurodegenerative Diseases (DZNE), Tübingen, Germany
| | - Mary B. Makarious
- Molecular Genetics Section, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
- Department of Clinical and Movement Neurosciences, UCL Queen Square Institute of Neurology, London, UK, WC1N 3BG
- UCL Movement Disorders Centre, University College London, London, UK, WC1N 3BG
| | - Hirotaka Iwaki
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Data Tecnica International LLC, Washington, DC, USA
| | - Jinhui Ding
- Computational Biology Group, Laboratory of Neurogenetics,National Institute on Aging, NIH, PorterNeuroscience ResearchCenter,Bethesda, MD, USA
| | - J. Raphael Gibbs
- Computational Biology Group, Laboratory of Neurogenetics,National Institute on Aging, NIH, PorterNeuroscience ResearchCenter,Bethesda, MD, USA
| | - Mayu Ishiguro
- Department of Neurology, Faculty of Medicine, Juntendo University, Hongo, Tokyo, Japan
| | - Hiroyo Yoshino
- Research Institute for Diseases of Old Age, Graduate School of Medicine, Juntendo University, Tokyo, Japan
| | - Kotaro Ogaki
- Department of Neurology, Faculty of Medicine, Juntendo University, Hongo, Tokyo, Japan
| | - Genko Oyama
- Department of Neurology, Faculty of Medicine, Juntendo University, Hongo, Tokyo, Japan
| | - Kenya Nishioka
- Department of Neurology, Juntendo Tokyo Koto Geriatric Medical Center 3-3-20 Shinsuna, Koto-ku, Tokyo 136-0075
| | - Risa Nonaka
- Department of Neurology, Faculty of Medicine, Juntendo University, Hongo, Tokyo, Japan
- Department of Clinical Data of Parkinson’s Disease, Graduate School of Medicine, Juntendo University, Tokyo, Japan
- Center for Genomic and Regenerative Medicine, Graduate School of Medicine, Juntendo University, Tokyo, Japan
| | - Wado Akamatsu
- Center for Genomic and Regenerative Medicine, Graduate School of Medicine, Juntendo University, Tokyo, Japan
| | - Cornelis Blauwendraat
- Integrative Neurogenomics Unit, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Nobutaka Hattori
- Department of Neurology, Faculty of Medicine, Juntendo University, Hongo, Tokyo, Japan
- Research Institute for Diseases of Old Age, Graduate School of Medicine, Juntendo University, Tokyo, Japan
- Neurodegenerative Disorders Collaborative Laboratory, RIKEN Center for Brain Science, Wako, Saitama, Japan
| |
Collapse
|
19
|
He Y, Chu Y, Guo S, Hu J, Li R, Zheng Y, Ma X, Du Z, Zhao L, Yu W, Xue J, Bian W, Yang F, Chen X, Zhang P, Wu R, Ma Y, Shao C, Chen J, Wang J, Li J, Wu J, Hu X, Long Q, Jiang M, Ye H, Song S, Li G, Wei Y, Xu Y, Ma Y, Chen Y, Wang K, Bao J, Xi W, Wang F, Ni W, Zhang M, Yu Y, Li S, Kang Y, Gao Z. T2T-YAO: A Telomere-to-telomere Assembled Diploid Reference Genome for Han Chinese. Genomics Proteomics Bioinformatics 2023; 21:1085-1100. [PMID: 37595788 DOI: 10.1016/j.gpb.2023.08.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 08/01/2023] [Accepted: 08/08/2023] [Indexed: 08/20/2023]
Abstract
Since its initial release in 2001, the human reference genome has undergone continuous improvement in quality, and the recently released telomere-to-telomere (T2T) version - T2T-CHM13 - reaches its highest level of continuity and accuracy after 20 years of effort by working on a simplified, nearly homozygous genome of a hydatidiform mole cell line. Here, to provide an authentic complete diploid human genome reference for the Han Chinese, the largest population in the world, we assembled the genome of a male Han Chinese individual, T2T-YAO, which includes T2T assemblies of all the 22 + X + M and 22 + Y chromosomes in both haploids. The quality of T2T-YAO is much better than those of all currently available diploid assemblies, and its haploid version, T2T-YAO-hp, generated by selecting the better assembly for each autosome, reaches the top quality of fewer than one error per 29.5 Mb, even higher than that of T2T-CHM13. Derived from an individual living in the aboriginal region of the Han population, T2T-YAO shows clear ancestry and potential genetic continuity from the ancient ancestors. Each haplotype of T2T-YAO possesses ∼ 330-Mb exclusive sequences, ∼ 3100 unique genes, and tens of thousands of nucleotide and structural variations as compared with CHM13, highlighting the necessity of a population-stratified reference genome. The construction of T2T-YAO, an accurate and authentic representative of the Chinese population, would enable precise delineation of genomic variations and advance our understandings in the hereditability of diseases and phenotypes, especially within the context of the unique variations of the Chinese population.
Collapse
Affiliation(s)
- Yukun He
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China
| | - Yanan Chu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Shuming Guo
- Linfen Clinical Medicine Research Center, Linfen 041000, China; Institute of Chest and Lung Diseases, Shanxi Medical University, Taiyuan 030001, China
| | - Jiang Hu
- GrandOmics Biosciences Co., Ltd, Wuhan 430076, China
| | - Ran Li
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yali Zheng
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Xinqian Ma
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Zhenglin Du
- Institute of PSI Genomics, Wenzhou 325024, China
| | - Lili Zhao
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Wenyi Yu
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Jianbo Xue
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Wenjie Bian
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Feifei Yang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Xi Chen
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Pingan Zhang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Rihan Wu
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yifan Ma
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Changjun Shao
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Jing Chen
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Jian Wang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Jiwei Li
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Jing Wu
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Xiaoyi Hu
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Qiuyue Long
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Mingzheng Jiang
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Hongli Ye
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Shixu Song
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Guangyao Li
- Linfen Clinical Medicine Research Center, Linfen 041000, China
| | - Yue Wei
- Linfen Clinical Medicine Research Center, Linfen 041000, China
| | - Yu Xu
- Beijing Jishuitan Hospital, Capital Medical University, Beijing 100035, China
| | - Yanliang Ma
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yanwen Chen
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Keqiang Wang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Jing Bao
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Wen Xi
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Fang Wang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Wentao Ni
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Moqin Zhang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yan Yu
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Shengnan Li
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yu Kang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100490, China.
| | - Zhancheng Gao
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Institute of Chest and Lung Diseases, Shanxi Medical University, Taiyuan 030001, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China.
| |
Collapse
|
20
|
Bartoš O, Klimešová B, Volfová K, Chmel M, Dresler J, Pajer P, Kabíčková H, Adamík P, Modrý D, Fučíková AM, Votýpka J. Two novel Bartonella (sub)species isolated from edible dormice ( Glis glis): hints of cultivation stress-induced genomic changes. Front Microbiol 2023; 14:1289671. [PMID: 38033559 PMCID: PMC10684924 DOI: 10.3389/fmicb.2023.1289671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 10/23/2023] [Indexed: 12/02/2023] Open
Abstract
Bartonelloses are neglected emerging infectious diseases caused by facultatively intracellular bacteria transmitted between vertebrate hosts by various arthropod vectors. The highest diversity of Bartonella species has been identified in rodents. Within this study we focused on the edible dormouse (Glis glis), a rodent with unique life-history traits that often enters households and whose possible role in the epidemiology of Bartonella infections had been previously unknown. We identified and cultivated two distinct Bartonella sub(species) significantly diverging from previously described species, which were characterized using growth characteristics, biochemical tests, and various molecular techniques including also proteomics. Two novel (sub)species were described: Bartonella grahamii subsp. shimonis subsp. nov. and Bartonella gliris sp. nov. We sequenced two individual strains per each described (sub)species. During exploratory genomic analyses comparing two genotypes ultimately belonging to the same species, both factually and most importantly even spatiotemporally, we noticed unexpectedly significant structural variation between them. We found that most of the detected structural variants could be explained either by prophage excision or integration. Based on a detailed study of one such event, we argue that prophage deletion represents the most probable explanation of the observed phenomena. Moreover, in one strain of Bartonella grahamii subsp. shimonis subsp. nov. we identified a deletion related to Bartonella Adhesin A, a major pathogenicity factor that modulates bacteria-host interactions. Altogether, our results suggest that even a limited number of passages induced sufficient selective pressure to promote significant changes at the level of the genome.
Collapse
Affiliation(s)
- Oldřich Bartoš
- Military Health Institute, Military Medical Agency, Prague, Czechia
| | - Běla Klimešová
- Department of Parasitology, Faculty of Science, Charles University, Prague, Czechia
| | - Karolina Volfová
- Department of Parasitology, Faculty of Science, Charles University, Prague, Czechia
| | - Martin Chmel
- Military Health Institute, Military Medical Agency, Prague, Czechia
- Department of Infectious Diseases, First Faculty of Medicine, Charles University and Military University Hospital Prague, Prague, Czechia
| | - Jiří Dresler
- Military Health Institute, Military Medical Agency, Prague, Czechia
| | - Petr Pajer
- Military Health Institute, Military Medical Agency, Prague, Czechia
| | - Hana Kabíčková
- Military Health Institute, Military Medical Agency, Prague, Czechia
| | - Peter Adamík
- Department of Zoology, Faculty of Science, Palacký University, Olomouc, Czechia
- Museum of Natural History, Olomouc, Czechia
| | - David Modrý
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czechia
- Department of Botany and Zoology, Faculty of Science, Masaryk University, Brno, Czechia
- Department of Veterinary Sciences/CINeZ, Faculty of Agrobiology, Food and Natural Resources, Czech University of Life Sciences Prague, Prague, Czechia
| | | | - Jan Votýpka
- Department of Parasitology, Faculty of Science, Charles University, Prague, Czechia
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czechia
| |
Collapse
|
21
|
Chandra G, Jain C. Gap-Sensitive Colinear Chaining Algorithms for Acyclic Pangenome Graphs. J Comput Biol 2023; 30:1182-1197. [PMID: 37902967 DOI: 10.1089/cmb.2023.0186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2023] Open
Abstract
A pangenome graph can serve as a better reference for genomic studies because it allows a compact representation of multiple genomes within a species. Aligning sequences to a graph is critical for pangenome-based resequencing. The seed-chain-extend heuristic works by finding short exact matches between a sequence and a graph. In this heuristic, colinear chaining helps identify a good cluster of exact matches that can be combined to form an alignment. Colinear chaining algorithms have been extensively studied for aligning two sequences with various gap costs, including linear, concave, and convex cost functions. However, extending these algorithms for sequence-to-graph alignment presents significant challenges. Recently, Makinen et al. introduced a sparse dynamic programming framework that exploits the small path cover property of acyclic pangenome graphs, enabling efficient chaining. However, this framework does not consider gap costs, limiting its practical effectiveness. We address this limitation by developing novel problem formulations and provably good chaining algorithms that support a variety of gap cost functions. These functions are carefully designed to enable fast chaining algorithms whose time requirements are parameterized in terms of the size of the minimum path cover. Through an empirical evaluation, we demonstrate the superior performance of our algorithm compared with existing aligners. When mapping simulated long reads to a pangenome graph comprising 95 human haplotypes, we achieved 98.7% precision while leaving <2% of reads unmapped.
Collapse
Affiliation(s)
- Ghanshyam Chandra
- Department of Computational and Data Sciences, Indian Institute of Science Bengaluru, India
| | - Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science Bengaluru, India
| |
Collapse
|
22
|
Zeng T, He Z, He J, Lv W, Huang S, Li J, Zhu L, Wan S, Zhou W, Yang Z, Zhang Y, Luo C, He J, Wang C, Wang L. The telomere-to-telomere gap-free reference genome of wild blueberry ( Vaccinium duclouxii) provides its high soluble sugar and anthocyanin accumulation. Hortic Res 2023; 10:uhad209. [PMID: 38023474 PMCID: PMC10681038 DOI: 10.1093/hr/uhad209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 10/19/2023] [Indexed: 12/01/2023]
Abstract
Vaccinium duclouxii, endemic to southwestern China, is a berry-producing shrub or small tree belonging to the Ericaceae family, with high nutritive, medicinal, and ornamental value, abundant germplasm resources, and good edible properties. In addition, V. duclouxii exhibits strong tolerance to adverse environmental conditions, making it a promising candidate for research and offering wide-ranging possibilities for utilization. However, the lack of V. duclouxii genome sequence has hampered its development and utilization. Here, a high-quality telomere-to-telomere genome sequence of V. duclouxii was de novo assembled and annotated. All of 12 chromosomes were assembled into gap-free single contigs, providing the highest integrity and quality assembly reported so far for blueberry. The V. duclouxii genome is 573.67 Mb, which encodes 41 953 protein-coding genes. Combining transcriptomics and metabolomics analyses, we have uncovered the molecular mechanisms involved in sugar and acid accumulation and anthocyanin biosynthesis in V. duclouxii. This provides essential molecular information for further research on the quality of V. duclouxii. Moreover, the high-quality telomere-to-telomere assembly of the V. duclouxii genome will provide insights into the genomic evolution of Vaccinium and support advancements in blueberry genetics and molecular breeding.
Collapse
Affiliation(s)
- Tuo Zeng
- School of Life Sciences, Guizhou Normal University, Guiyang 550000, China
| | - Zhijiao He
- Institute of Alpine Economic Plant, Yunnan Academy of Agricultural Sciences, Lijiang 674199, Yunnan, China
| | - Jiefang He
- School of Life Sciences, Guizhou Normal University, Guiyang 550000, China
| | - Wei Lv
- School of Life Sciences, Guizhou Normal University, Guiyang 550000, China
| | - Shixiang Huang
- School of Life Sciences, Guizhou Normal University, Guiyang 550000, China
| | - Jiawen Li
- School of Advanced Agricultural Sciences, Peking University, 100871 Beijing, China
| | - Liyong Zhu
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture & Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Shuang Wan
- Wuhan Benagen Technology Co., Ltd, Wuhan 430070, China
| | - Wanfei Zhou
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture & Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Zhengsong Yang
- Institute of Alpine Economic Plant, Yunnan Academy of Agricultural Sciences, Lijiang 674199, Yunnan, China
| | - Yatao Zhang
- School of Life Sciences, Guizhou Normal University, Guiyang 550000, China
| | - Chong Luo
- School of Life Sciences, Guizhou Normal University, Guiyang 550000, China
| | - Jiawei He
- Institute of Alpine Economic Plant, Yunnan Academy of Agricultural Sciences, Lijiang 674199, Yunnan, China
| | - Caiyun Wang
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture & Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Liangsheng Wang
- Key Laboratory of Plant Resources, Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China
- China National Botanical Garden, Beijing 100093, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
23
|
Majidian S, Agustinho DP, Chin CS, Sedlazeck FJ, Mahmoud M. Genomic variant benchmark: if you cannot measure it, you cannot improve it. Genome Biol 2023; 24:221. [PMID: 37798733 PMCID: PMC10552390 DOI: 10.1186/s13059-023-03061-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 09/18/2023] [Indexed: 10/07/2023] Open
Abstract
Genomic benchmark datasets are essential to driving the field of genomics and bioinformatics. They provide a snapshot of the performances of sequencing technologies and analytical methods and highlight future challenges. However, they depend on sequencing technology, reference genome, and available benchmarking methods. Thus, creating a genomic benchmark dataset is laborious and highly challenging, often involving multiple sequencing technologies, different variant calling tools, and laborious manual curation. In this review, we discuss the available benchmark datasets and their utility. Additionally, we focus on the most recent benchmark of genes with medical relevance and challenging genomic complexity.
Collapse
Affiliation(s)
- Sina Majidian
- Department of Computational Biology, University of Lausanne, 1015, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | | | | | - Fritz J Sedlazeck
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, 77030, USA.
- Department of Computer Science, Rice University, 6100 Main Street, Houston, TX, 77005, USA.
| | - Medhat Mahmoud
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, 77030, USA.
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
| |
Collapse
|
24
|
Yang C, Zhou Y, Song Y, Wu D, Zeng Y, Nie L, Liu P, Zhang S, Chen G, Xu J, Zhou H, Zhou L, Qian X, Liu C, Tan S, Zhou C, Dai W, Xu M, Qi Y, Wang X, Guo L, Fan G, Wang A, Deng Y, Zhang Y, Jin J, He Y, Guo C, Guo G, Zhou Q, Xu X, Yang H, Wang J, Xu S, Mao Y, Jin X, Ruan J, Zhang G. The complete and fully-phased diploid genome of a male Han Chinese. Cell Res 2023; 33:745-761. [PMID: 37452091 PMCID: PMC10542383 DOI: 10.1038/s41422-023-00849-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2023] [Accepted: 06/29/2023] [Indexed: 07/18/2023] Open
Abstract
Since the release of the complete human genome, the priority of human genomic study has now been shifting towards closing gaps in ethnic diversity. Here, we present a fully phased and well-annotated diploid human genome from a Han Chinese male individual (CN1), in which the assemblies of both haploids achieve the telomere-to-telomere (T2T) level. Comparison of this diploid genome with the CHM13 haploid T2T genome revealed significant variations in the centromere. Outside the centromere, we discovered 11,413 structural variations, including numerous novel ones. We also detected thousands of CN1 alleles that have accumulated high substitution rates and a few that have been under positive selection in the East Asian population. Further, we found that CN1 outperforms CHM13 as a reference genome in mapping and variant calling for the East Asian population owing to the distinct structural variants of the two references. Comparison of SNP calling for a large cohort of 8869 Chinese genomes using CN1 and CHM13 as reference respectively showed that the reference bias profoundly impacts rare SNP calling, with nearly 2 million rare SNPs miss-called with different reference genomes. Finally, applying the CN1 as a reference, we discovered 5.80 Mb and 4.21 Mb putative introgression sequences from Neanderthal and Denisovan, respectively, including many East Asian specific ones undetected using CHM13 as the reference. Our analyses reveal the advances of using CN1 as a reference for population genomic studies and paleo-genomic studies. This complete genome will serve as an alternative reference for future genomic studies on the East Asian population.
Collapse
Affiliation(s)
- Chentao Yang
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Yang Zhou
- BGI-Shenzhen, Shenzhen, Guangdong, China
- BGI Research-Wuhan, BGI, Wuhan, Hubei, China
| | - Yanni Song
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China
| | - Dongya Wu
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China
- Institute of Crop Science & Institute of Bioinformatics, Zhejiang University, Hangzhou, Zhejiang, China
| | - Yan Zeng
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Lei Nie
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | | | - Shilong Zhang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Guangji Chen
- BGI-Shenzhen, Shenzhen, Guangdong, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Jinjin Xu
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Hongling Zhou
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China
| | - Long Zhou
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China
- Innovation Center of Yangtze River Delta, Zhejiang University, Hangzhou, Zhejiang, China
| | - Xiaobo Qian
- BGI-Shenzhen, Shenzhen, Guangdong, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Chenlu Liu
- Life Sciences Institute, Zhejiang University, Hangzhou, Zhejiang, China
| | | | | | - Wei Dai
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Mengyang Xu
- BGI-Shenzhen, Shenzhen, Guangdong, China
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Yanwei Qi
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Xiaobo Wang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China
| | - Lidong Guo
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Guangyi Fan
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Aijun Wang
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Yuan Deng
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Yong Zhang
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | | | - Yunqiu He
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| | - Chunxue Guo
- BGI-Shenzhen, Shenzhen, Guangdong, China
- BGI-Hangzhou, Hangzhou, Zhejiang, China
| | - Guoji Guo
- School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
| | - Qing Zhou
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China
- Life Sciences Institute, Zhejiang University, Hangzhou, Zhejiang, China
| | - Xun Xu
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | | | - Jian Wang
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Shuhua Xu
- State Key Laboratory of Genetic Engineering, Center for Evolutionary Biology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, China
- Human Phenome Institute, Zhangjiang Fudan International Innovation Center, and Ministry of Education Key Laboratory of Contemporary Anthropology, Fudan University, Shanghai, China
- Jiangsu Key Laboratory of Phylogenomics & Comparative Genomics, International Joint Center of Genomics of Jiangsu Province School of Life Sciences, Jiangsu Normal University, Xuzhou, Jiangsu, China
- Department of Liver Surgery and Transplantation Liver Cancer Institute, Zhongshan Hospital, Fudan University, Shanghai, China
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, Yunnan, China
| | - Yafei Mao
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Xin Jin
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China.
| | - Guojie Zhang
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China.
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China.
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China.
- Innovation Center of Yangtze River Delta, Zhejiang University, Hangzhou, Zhejiang, China.
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan, China.
| |
Collapse
|
25
|
Liu Y, Shen X, Gong Y, Liu Y, Song B, Zeng X. Sequence Alignment/Map format: a comprehensive review of approaches and applications. Brief Bioinform 2023; 24:bbad320. [PMID: 37668049 DOI: 10.1093/bib/bbad320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 08/16/2023] [Accepted: 08/18/2023] [Indexed: 09/06/2023] Open
Abstract
The Sequence Alignment/Map (SAM) format file is the text file used to record alignment information. Alignment is the core of sequencing analysis, and downstream tasks accept mapping results for further processing. Given the rapid development of the sequencing industry today, a comprehensive understanding of the SAM format and related tools is necessary to meet the challenges of data processing and analysis. This paper is devoted to retrieving knowledge in the broad field of SAM. First, the format of SAM is introduced to understand the overall process of the sequencing analysis. Then, existing work is systematically classified in accordance with generation, compression and application, and the involved SAM tools are specifically mined. Lastly, a summary and some thoughts on future directions are provided.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangzhen Shen
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Yongshun Gong
- School of Software, Shandong University, 250100, Jinan, China
| | - Yiping Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| |
Collapse
|
26
|
Liao X, Zhu W, Zhou J, Li H, Xu X, Zhang B, Gao X. Repetitive DNA sequence detection and its role in the human genome. Commun Biol 2023; 6:954. [PMID: 37726397 PMCID: PMC10509279 DOI: 10.1038/s42003-023-05322-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 09/04/2023] [Indexed: 09/21/2023] Open
Abstract
Repetitive DNA sequences playing critical roles in driving evolution, inducing variation, and regulating gene expression. In this review, we summarized the definition, arrangement, and structural characteristics of repeats. Besides, we introduced diverse biological functions of repeats and reviewed existing methods for automatic repeat detection, classification, and masking. Finally, we analyzed the type, structure, and regulation of repeats in the human genome and their role in the induction of complex diseases. We believe that this review will facilitate a comprehensive understanding of repeats and provide guidance for repeat annotation and in-depth exploration of its association with human diseases.
Collapse
Affiliation(s)
- Xingyu Liao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Wufei Zhu
- Department of Endocrinology, Yichang Central People's Hospital, The First College of Clinical Medical Science, China Three Gorges University, 443000, Yichang, P.R. China
| | - Juexiao Zhou
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Haoyang Li
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Xiaopeng Xu
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Bin Zhang
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia.
| |
Collapse
|
27
|
Kille B, Garrison E, Treangen TJ, Phillippy AM. Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation. Bioinformatics 2023; 39:btad512. [PMID: 37603771 PMCID: PMC10505501 DOI: 10.1093/bioinformatics/btad512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 07/19/2023] [Accepted: 08/18/2023] [Indexed: 08/23/2023] Open
Abstract
MOTIVATION The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. RESULTS To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. AVAILABILITY AND IMPLEMENTATION MashMap3 is available at https://github.com/marbl/MashMap.
Collapse
Affiliation(s)
- Bryce Kille
- Department of Computer Science, Rice University, Houston, TX, United States
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, United States
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|
28
|
Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, Hook PW, Koren S, Rautiainen M, Alexandrov IA, Allen J, Asri M, Bzikadze AV, Chen NC, Chin CS, Diekhans M, Flicek P, Formenti G, Fungtammasan A, Garcia Giron C, Garrison E, Gershman A, Gerton JL, Grady PGS, Guarracino A, Haggerty L, Halabian R, Hansen NF, Harris R, Hartley GA, Harvey WT, Haukness M, Heinz J, Hourlier T, Hubley RM, Hunt SE, Hwang S, Jain M, Kesharwani RK, Lewis AP, Li H, Logsdon GA, Lucas JK, Makalowski W, Markovic C, Martin FJ, Mc Cartney AM, McCoy RC, McDaniel J, McNulty BM, Medvedev P, Mikheenko A, Munson KM, Murphy TD, Olsen HE, Olson ND, Paulin LF, Porubsky D, Potapova T, Ryabov F, Salzberg SL, Sauria MEG, Sedlazeck FJ, Shafin K, Shepelev VA, Shumate A, Storer JM, Surapaneni L, Taravella Oill AM, Thibaud-Nissen F, Timp W, Tomaszkiewicz M, Vollger MR, Walenz BP, Watwood AC, Weissensteiner MH, Wenger AM, Wilson MA, Zarate S, Zhu Y, Zook JM, Eichler EE, O'Neill RJ, Schatz MC, Miga KH, Makova KD, Phillippy AM. The complete sequence of a human Y chromosome. Nature 2023; 621:344-354. [PMID: 37612512 PMCID: PMC10752217 DOI: 10.1038/s41586-023-06457-y] [Citation(s) in RCA: 41] [Impact Index Per Article: 41.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 07/19/2023] [Indexed: 08/25/2023]
Abstract
The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications1-3. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished4,5. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a previous assembly of the CHM13 genome4 and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.
Collapse
Affiliation(s)
- Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Oxford Nanopore Technologies Inc., Oxford, UK
| | - Monika Cechova
- Faculty of Informatics, Masaryk University, Brno, Czech Republic
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Savannah J Hoyt
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Nicolas Altemose
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
| | - Paul W Hook
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ivan A Alexandrov
- Federal Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- Department of Anatomy and Anthropology and Department of Human Molecular Genetics and Biochemistry, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv-Yafo, Israel
| | - Jamie Allen
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Andrey V Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, CA, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Chen-Shan Chin
- GeneDX Holdings Corp, Stamford, CT, USA
- Foundation of Biological Data Science, Belmont, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Department of Genetics, University of Cambridge, Cambridge, UK
| | | | | | - Carlos Garcia Giron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ariel Gershman
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer L Gerton
- Stowers Institute for Medical Research, Kansas City, MO, USA
- University of Kansas Medical Center, Kansas City, MO, USA
| | - Patrick G S Grady
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Reza Halabian
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Nancy F Hansen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Cancer Genetics and Comparative Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Robert Harris
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Gabrielle A Hartley
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Jakob Heinz
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Sarah E Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Stephen Hwang
- XDBio Program, Johns Hopkins University, Baltimore, MD, USA
| | - Miten Jain
- Department of Bioengineering, Department of Physics, Northeastern University, Boston, MA, USA
| | - Rupesh K Kesharwani
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Julian K Lucas
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Wojciech Makalowski
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Christopher Markovic
- Genome Technology Access Center at the McDonnell Genome Institute, Washington University, St. Louis, MO, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Ann M Mc Cartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer McDaniel
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brandy M McNulty
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, USA
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA
- Center for Computational Biology and Bioinformatics, Pennsylvania State University, University Park, PA, USA
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- UCL Queen Square Institute of Neurology, UCL, London, UK
| | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Hugh E Olsen
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Nathan D Olson
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Luis F Paulin
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Tamara Potapova
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Fedor Ryabov
- Masters Program in National Research University Higher School of Economics, Moscow, Russia
| | - Steven L Salzberg
- Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | | | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | | | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | | | - Likhitha Surapaneni
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Angela M Taravella Oill
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Marta Tomaszkiewicz
- Department of Biology, Pennsylvania State University, University Park, PA, USA
- Department of Biomedical Engineering, Pennsylvania State University, State College, PA, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison C Watwood
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | | | | | - Melissa A Wilson
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Samantha Zarate
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Yiming Zhu
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Justin M Zook
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Investigator, Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Rachel J O'Neill
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- Department of Genetics and Genome Sciences, UConn Health, Farmington, CT, USA
| | - Michael C Schatz
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Karen H Miga
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Kateryna D Makova
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
29
|
Daida K, Funayama M, Billingsley KJ, Malik L, Miano-Burkhardt A, Leonard HL, Makarious MB, Iwaki H, Ding J, Gibbs JR, Ishiguro M, Yoshino H, Ogaki K, Oyama G, Nishioka K, Nonaka R, Akamatsu W, Blauwendraat C, Hattori N. Long-read sequencing resolves a complex structural variant in PRKN Parkinson's disease. medRxiv 2023:2023.08.14.23293948. [PMID: 37790330 PMCID: PMC10543050 DOI: 10.1101/2023.08.14.23293948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
Background PRKN mutations are the most common cause of young onset and autosomal recessive Parkinson's disease (PD). PRKN is located in FRA6E which is one of the common fragile sites in the human genome, making this region prone to structural variants. However, complex structural variants such as inversions of PRKN are seldom reported, suggesting that there are potentially unrevealed complex pathogenic PRKN structural variants. Objectives To identify complex structural variants in PRKN using long-read sequencing. Methods We investigated the genetic cause of monozygotic twins presenting with a young onset dystonia-parkinsonism using targeted sequencing, whole exome sequencing, multiple ligation probe amplification, and long-read. We assessed the presence and frequency of complex inversions overlapping PRKN using whole-genome sequencing data of AMP-PD and UK-Biobank datasets. Results Multiple ligation probe amplification identified a heterozygous exon 3 deletion in PRKN and long-read sequencing identified a large novel inversion spanning over 7Mb, including a large part of the coding DNA sequence of PRKN. We could diagnose the affected subjects as compound heterozygous carriers of PRKN. We analyzed whole genome sequencing data of 43,538 participants of the UK-Biobank and 4,941 participants of the AMP-PD datasets. Nine inversions in the UK-Biobank and two in AMP PD were identified and were considered potentially damaging and likely to affect PRKN isoforms. Conclusions This is the first report describing a large 7Mb inversion involving breakpoints outside of PRKN. This study highlights the importance of using long-read whole genome sequencing for structural variant analysis in unresolved young-onset PD cases.
Collapse
Affiliation(s)
- Kensuke Daida
- Integrative Neurogenomics Unit, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Department of Neurology, Juntendo University School of Medicine, Hongo, Tokyo, Japan
| | - Manabu Funayama
- Department of Neurology, Juntendo University School of Medicine, Hongo, Tokyo, Japan
- Research Institute for Diseases of Old Age, Graduate School of Medicine, Juntendo University, Tokyo, Japan
| | - Kimberley J Billingsley
- Molecular Genetics Section, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Laksh Malik
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Abigail Miano-Burkhardt
- Molecular Genetics Section, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Hampton L. Leonard
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Molecular Genetics Section, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
- Data Tecnica International LLC, Washington, DC, USA
- German Center for Neurodegenerative Diseases (DZNE), Tubingen, Germany
| | - Mary B. Makarious
- Molecular Genetics Section, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
- Department of Clinical and Movement Neurosciences, UCL Queen Square Institute of Neurology, London, UK, WC1N 3BG
- UCL Movement Disorders Centre, University College London, London, UK, WC1N 3BG
| | - Hirotaka Iwaki
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Data Tecnica International LLC, Washington, DC, USA
| | - Jinhui Ding
- BiocomputationalGroup, Laboratory of Neurogenetics,National Institute on Aging, NIH, PorterNeuroscience ResearchCenter,Bethesda, MD, USA
| | - J. Raphael Gibbs
- BiocomputationalGroup, Laboratory of Neurogenetics,National Institute on Aging, NIH, PorterNeuroscience ResearchCenter,Bethesda, MD, USA
| | - Mayu Ishiguro
- Department of Neurology, Juntendo University School of Medicine, Hongo, Tokyo, Japan
| | - Hiroyo Yoshino
- Research Institute for Diseases of Old Age, Graduate School of Medicine, Juntendo University, Tokyo, Japan
| | - Kotaro Ogaki
- Department of Neurology, Juntendo University School of Medicine, Hongo, Tokyo, Japan
| | - Genko Oyama
- Department of Neurology, Juntendo University School of Medicine, Hongo, Tokyo, Japan
| | - Kenya Nishioka
- Department of Neurology, Juntendo University School of Medicine, Hongo, Tokyo, Japan
| | - Risa Nonaka
- Department of Neurology, Juntendo University School of Medicine, Hongo, Tokyo, Japan
- Department of Clinical Data of Parkinson’s Disease, Graduate School of Medicine, Juntendo University, Tokyo, Japan
- Center for Genomic and Regenerative Medicine, Graduate School of Medicine, Juntendo University, Tokyo, Japan
| | - Wado Akamatsu
- Center for Genomic and Regenerative Medicine, Graduate School of Medicine, Juntendo University, Tokyo, Japan
| | - Cornelis Blauwendraat
- Integrative Neurogenomics Unit, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Nobutaka Hattori
- Department of Neurology, Juntendo University School of Medicine, Hongo, Tokyo, Japan
- Research Institute for Diseases of Old Age, Graduate School of Medicine, Juntendo University, Tokyo, Japan
- Neurodegenerative Disorders Collaborative Laboratory, RIKEN Center for Brain Science, Wako, Saitama, Japan
| |
Collapse
|
30
|
Ayad LAK, Chikhi R, Pissis SP. Seedability: optimizing alignment parameters for sensitive sequence comparison. Bioinform Adv 2023; 3:vbad108. [PMID: 37621456 PMCID: PMC10444664 DOI: 10.1093/bioadv/vbad108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 08/02/2023] [Accepted: 08/10/2023] [Indexed: 08/26/2023]
Abstract
Motivation Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2 , use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability , a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences. Results The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2 . We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments. Availability and implementation https://github.com/lorrainea/Seedability (distributed under GPL v3.0).
Collapse
Affiliation(s)
- Lorraine A K Ayad
- Department of Computer Science, Brunel University London, London UB8 3PH, UK
| | - Rayan Chikhi
- G5 Sequence Bioinformatics, Institut Pasteur, Université Paris Cité, 75015 Paris, France
| | - Solon P Pissis
- Networks & Optimization, CWI, 1098 XG Amsterdam, The Netherlands
- Department of Computer Science, Vrije Universiteit, 1081 HV Amsterdam, The Netherlands
| |
Collapse
|
31
|
Hung KL, Jones MG, Wong ITL, Lange JT, Luebeck J, Scanu E, He BJ, Brückner L, Li R, González RC, Schmargon R, Dörr JR, Belk JA, Bafna V, Werner B, Huang W, Henssen AG, Mischel PS, Chang HY. Coordinated inheritance of extrachromosomal DNA species in human cancer cells. bioRxiv 2023:2023.07.18.549597. [PMID: 37503111 PMCID: PMC10371175 DOI: 10.1101/2023.07.18.549597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
The chromosomal theory of inheritance has dominated human genetics, including cancer genetics. Genes on the same chromosome segregate together while genes on different chromosomes assort independently, providing a fundamental tenet of Mendelian inheritance. Extrachromosomal DNA (ecDNA) is a frequent event in cancer that drives oncogene amplification, dysregulated gene expression and intratumoral heterogeneity, including through random segregation during cell division. Distinct ecDNA sequences, herein termed ecDNA species, can co-exist to facilitate intermolecular cooperation in cancer cells. However, how multiple ecDNA species within a tumor cell are assorted and maintained across somatic cell generations to drive cancer cell evolution is not known. Here we show that cooperative ecDNA species can be coordinately inherited through mitotic co-segregation. Imaging and single-cell analyses show that multiple ecDNAs encoding distinct oncogenes co-occur and are correlated in copy number in human cancer cells. EcDNA species are coordinately segregated asymmetrically during mitosis, resulting in daughter cells with simultaneous copy number gains in multiple ecDNA species prior to any selection. Computational modeling reveals the quantitative principles of ecDNA co-segregation and co-selection, predicting their observed distributions in cancer cells. Finally, we show that coordinated inheritance of ecDNAs enables co-amplification of specialized ecDNAs containing only enhancer elements and guides therapeutic strategies to jointly deplete cooperating ecDNA oncogenes. Coordinated inheritance of ecDNAs confers stability to oncogene cooperation and novel gene regulatory circuits, allowing winning combinations of epigenetic states to be transmitted across cell generations.
Collapse
Affiliation(s)
- King L. Hung
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
| | - Matthew G. Jones
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
| | - Ivy Tsz-Lo Wong
- Sarafan ChEM-H, Stanford University, Stanford, CA, USA
- Department of Pathology, Stanford University, Stanford, CA, USA
| | - Joshua T. Lange
- Sarafan ChEM-H, Stanford University, Stanford, CA, USA
- Department of Pathology, Stanford University, Stanford, CA, USA
| | - Jens Luebeck
- Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, 92093, USA
| | - Elisa Scanu
- Department of Mathematics, Queen Mary University of London, London, UK
| | - Britney Jiayu He
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
| | - Lotte Brückner
- Max-Delbrück-Centrum für Molekulare Medizin (BIMSB/BIH), Berlin, Germany
- Experimental and Clinical Research Center (ECRC), Max Delbrück Center for Molecular Medicine and Charité—Universitätsmedizin Berlin, Lindenberger Weg 80, 13125, Berlin, Germany
| | - Rui Li
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
| | - Rocío Chamorro González
- Experimental and Clinical Research Center (ECRC), Max Delbrück Center for Molecular Medicine and Charité—Universitätsmedizin Berlin, Lindenberger Weg 80, 13125, Berlin, Germany
- Department of Pediatric Oncology/Hematology, Charité—Universitätsmedizin Berlin, Augustenburger Platz 1, 13353, Berlin, Germany
| | - Rachel Schmargon
- Experimental and Clinical Research Center (ECRC), Max Delbrück Center for Molecular Medicine and Charité—Universitätsmedizin Berlin, Lindenberger Weg 80, 13125, Berlin, Germany
- Department of Pediatric Oncology/Hematology, Charité—Universitätsmedizin Berlin, Augustenburger Platz 1, 13353, Berlin, Germany
| | - Jan R. Dörr
- Experimental and Clinical Research Center (ECRC), Max Delbrück Center for Molecular Medicine and Charité—Universitätsmedizin Berlin, Lindenberger Weg 80, 13125, Berlin, Germany
- Department of Pediatric Oncology/Hematology, Charité—Universitätsmedizin Berlin, Augustenburger Platz 1, 13353, Berlin, Germany
| | - Julia A. Belk
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, 92093, USA
| | - Benjamin Werner
- Evolutionary Dynamics Group, Centre for Cancer Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | - Weini Huang
- Department of Mathematics, Queen Mary University of London, London, UK
- Group of Theoretical Biology, The State Key Laboratory of Biocontrol, School of Life Science, Sun Yat-sen University, Guangzhou, China
| | - Anton G. Henssen
- Experimental and Clinical Research Center (ECRC), Max Delbrück Center for Molecular Medicine and Charité—Universitätsmedizin Berlin, Lindenberger Weg 80, 13125, Berlin, Germany
- Department of Pediatric Oncology/Hematology, Charité—Universitätsmedizin Berlin, Augustenburger Platz 1, 13353, Berlin, Germany
- German Cancer Consortium (DKTK), partner site Berlin, and German Cancer Research Center DKFZ, Im Neuenheimer Feld 280, 69120, Heidelberg, Germany
- Berlin Institute of Health, Anna-Louisa-Karsch-Str. 2, 10178, Berlin, Germany
| | - Paul S. Mischel
- Sarafan ChEM-H, Stanford University, Stanford, CA, USA
- Department of Pathology, Stanford University, Stanford, CA, USA
| | - Howard Y. Chang
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA 94305, USA
| |
Collapse
|
32
|
Ekim B, Sahlin K, Medvedev P, Berger B, Chikhi R. Efficient mapping of accurate long reads in minimizer space with mapquik. Genome Res 2023; 33:1188-1197. [PMID: 37399256 PMCID: PMC10538364 DOI: 10.1101/gr.277679.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2023] [Accepted: 06/26/2023] [Indexed: 07/05/2023]
Abstract
DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with [Formula: see text] sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a [Formula: see text] speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a [Formula: see text] speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic [Formula: see text] pseudochaining algorithm, which improves upon the long-standing [Formula: see text] bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.
Collapse
Affiliation(s)
- Bariş Ekim
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
| | - Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, SE-106 91 Stockholm, Sweden
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, 75015 Paris, France
| |
Collapse
|
33
|
Maier BD, Sahlin K. Entropy predicts sensitivity of pseudorandom seeds. Genome Res 2023; 33:1162-1174. [PMID: 37217253 PMCID: PMC10538493 DOI: 10.1101/gr.277645.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Accepted: 05/04/2023] [Indexed: 05/24/2023]
Abstract
Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although k-mers and spaced k-mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present. Recently, we developed a pseudorandom seeding construct, strobemers, which was empirically shown to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why. In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness-sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs: mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to show that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three new seed constructs are useful for read mapping and ANI estimation. For read mapping, we implement strobemers into minimap2 and observe 30% faster alignment time and 0.2% higher accuracy than using k-mers when mapping reads at high error rates. As for ANI estimation, we find that higher entropy seeds have a higher rank correlation between estimated and true ANI.
Collapse
Affiliation(s)
| | - Kristoffer Sahlin
- Department of Mathematics, Stockholm University, 106 91 Stockholm, Sweden
| |
Collapse
|
34
|
Ma B, Wang H, Liu J, Chen L, Xia X, Wei W, Yang Z, Yuan J, Luo Y, He N. The gap-free genome of mulberry elucidates the architecture and evolution of polycentric chromosomes. Hortic Res 2023; 10:uhad111. [PMID: 37786730 PMCID: PMC10541557 DOI: 10.1093/hr/uhad111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 05/15/2023] [Indexed: 10/04/2023]
Abstract
Mulberry is a fundamental component of the global sericulture industry, and its positive impact on our health and the environment cannot be overstated. However, the mulberry reference genomes reported previously remained unassembled or unplaced sequences. Here, we report the assembly and analysis of the telomere-to-telomere gap-free reference genome of the mulberry species, Morus notabilis, which has emerged as an important reference in mulberry gene function research and genetic improvement. The mulberry gap-free reference genome produced here provides an unprecedented opportunity for us to study the structure and function of centromeres. Our results revealed that all mulberry centromeric regions share conserved centromeric satellite repeats with different copies. Strikingly, we found that M. notabilis is a species with polycentric chromosomes and the only reported polycentric chromosome species up to now. We propose a compelling model that explains the formation mechanism of new centromeres and addresses the unsolved scientific question of the chromosome fusion-fission cycle in mulberry species. Our study sheds light on the functional genomics, chromosome evolution, and genetic improvement of mulberry species.
Collapse
Affiliation(s)
- Bi Ma
- State Key Laboratory of Resource Insects, Southwest University, Chongqing, 400715, China
| | - Honghong Wang
- State Key Laboratory of Resource Insects, Southwest University, Chongqing, 400715, China
| | - Jingchun Liu
- State Key Laboratory of Resource Insects, Southwest University, Chongqing, 400715, China
| | - Lin Chen
- State Key Laboratory of Resource Insects, Southwest University, Chongqing, 400715, China
| | - Xiaoyu Xia
- State Key Laboratory of Resource Insects, Southwest University, Chongqing, 400715, China
| | - Wuqi Wei
- State Key Laboratory of Resource Insects, Southwest University, Chongqing, 400715, China
| | - Zhen Yang
- State Key Laboratory of Resource Insects, Southwest University, Chongqing, 400715, China
| | - Jianglian Yuan
- State Key Laboratory of Resource Insects, Southwest University, Chongqing, 400715, China
| | - Yiwei Luo
- State Key Laboratory of Resource Insects, Southwest University, Chongqing, 400715, China
| | - Ningjia He
- State Key Laboratory of Resource Insects, Southwest University, Chongqing, 400715, China
| |
Collapse
|
35
|
Pazhenkova EA, Lukhtanov VA. Chromosomal conservatism vs chromosomal megaevolution: enigma of karyotypic evolution in Lepidoptera. Chromosome Res 2023; 31:16. [PMID: 37300756 DOI: 10.1007/s10577-023-09725-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 05/21/2023] [Accepted: 05/23/2023] [Indexed: 06/12/2023]
Abstract
In the evolution of many organisms, periods of slow genome reorganization (= chromosomal conservatism) are interrupted by bursts of numerous chromosomal changes (= chromosomal megaevolution). Using comparative analysis of chromosome-level genome assemblies, we investigated these processes in blue butterflies (Lycaenidae). We demonstrate that the phase of chromosome number conservatism is characterized by the stability of most autosomes and dynamic evolution of the sex chromosome Z, resulting in multiple variants of NeoZ chromosomes due to autosome-sex chromosome fusions. In contrast during the phase of rapid chromosomal evolution, the explosive increase in chromosome number occurs mainly through simple chromosomal fissions. We show that chromosomal megaevolution is a highly non-random canalized process, and in two phylogenetically independent Lysandra lineages, the drastic parallel increase in number of fragmented chromosomes was achieved, at least partially, through reuse of the same ancestral chromosomal breakpoints. In species showing chromosome number doubling, we found no blocks of duplicated sequences or duplicated chromosomes, thus refuting the hypothesis of polyploidy. In the studied taxa, long blocks of interstitial telomere sequences (ITSs) consist of (TTAGG)n arrays interspersed with telomere-specific retrotransposons. ITSs are sporadically present in rapidly evolving Lysandra karyotypes, but not in the species with ancestral chromosome number. Therefore, we hypothesize that the transposition of telomeric sequences may be triggers of the rapid chromosome number increase. Finally, we discuss the hypothetical genomic and population mechanisms of chromosomal megaevolution and argue that the disproportionally high evolutionary role of the Z sex chromosome can be additionally reinforced by sex chromosome-autosome fusions and Z-chromosome inversions.
Collapse
Affiliation(s)
- Elena A Pazhenkova
- Department of Biology, Biotechnical Faculty, University of Ljubljana, Večna Pot 111, 1000, Ljubljana, Slovenia.
| | - Vladimir A Lukhtanov
- Department of Karyosystematics, Zoological Institute of Russian Academy of Sciences, Universitetskaya Nab. 1, 199034, St. Petersburg, Russia.
| |
Collapse
|
36
|
Abstract
It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings ( http://bcazaux.polytech-lille.net/Minimap2/ ).
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91, Stockholm, Sweden.
| | - Thomas Baudeau
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Bastien Cazaux
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Camille Marchet
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France.
| |
Collapse
|
37
|
Rizig M, Bandres-Ciga S, Makarious MB, Ojo O, Crea PW, Abiodun O, Levine KS, Abubakar S, Achoru C, Vitale D, Adeniji O, Agabi O, Koretsky MJ, Agulanna U, Hall DA, Akinyemi R, Xie T, Ali M, Shamim EA, Ani-Osheku I, Padmanaban M, Arigbodi O, Standaert DG, Bello A, Dean M, Erameh C, Elsayed I, Farombi T, Okunoye O, Fawale M, Billingsley KJ, Imarhiagbe F, Jerez PA, Iwuozo E, Baker B, Komolafe M, Malik L, Nwani P, Daida K, Nwazor E, Miano-Burkhardt A, Nyandaiti Y, Fang ZH, Obiabo Y, Kluss JH, Odeniyi O, Hernandez D, Odiase F, Tayebi N, Ojini F, Sidranksy E, Onwuegbuzie G, D’Souza AM, Osaigbovo G, Berhe B, Osemwegie N, Reed X, Oshinaike O, Leonard H, Otubogun F, Alvarado CX, Oyakhire S, Ozomma S, Samuel S, Taiwo F, Wahab K, Zubair Y, Iwaki H, Kim JJ, Morris HR, Hardy J, Nalls M, Heilbron K, Norcliffe-Kaufmann L, Blauwendraat C, Houlden H, Singleton A, Okubadejo N. Genome-wide Association Identifies Novel Etiological Insights Associated with Parkinson's Disease in African and African Admixed Populations. medRxiv 2023:2023.05.05.23289529. [PMID: 37398408 PMCID: PMC10312852 DOI: 10.1101/2023.05.05.23289529] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Background Understanding the genetic mechanisms underlying diseases in ancestrally diverse populations is a critical step towards the realization of the global application of precision medicine. The African and African admixed populations enable mapping of complex traits given their greater levels of genetic diversity, extensive population substructure, and distinct linkage disequilibrium patterns. Methods Here we perform a comprehensive genome-wide assessment of Parkinson's disease (PD) in 197,918 individuals (1,488 cases; 196,430 controls) of African and African admixed ancestry, characterizing population-specific risk, differential haplotype structure and admixture, coding and structural genetic variation and polygenic risk profiling. Findings We identified a novel common risk factor for PD and age at onset at the GBA1 locus (risk, rs3115534-G; OR=1.58, 95% CI = 1.37 - 1.80, P=2.397E-14; age at onset, BETA =-2.004, SE =0.57, P = 0.0005), that was found to be rare in non-African/African admixed populations. Downstream short- and long-read whole genome sequencing analyses did not reveal any coding or structural variant underlying the GWAS signal. However, we identified that this signal mediates PD risk via expression quantitative trait locus (eQTL) mechanisms. While previously identified GBA1 associated disease risk variants are coding mutations, here we suggest a novel functional mechanism consistent with a trend in decreasing glucocerebrosidase activity levels. Given the high population frequency of the underlying signal and the phenotypic characteristics of the homozygous carriers, we hypothesize that this variant may not cause Gaucher disease. Additionally, the prevalence of Gaucher's disease in Africa is low. Interpretation The present study identifies a novel African-ancestry genetic risk factor in GBA1 as a major mechanistic basis of PD in the African and African admixed populations. This striking result contrasts to previous work in Northern European populations, both in terms of mechanism and attributable risk. This finding highlights the importance of understanding population-specific genetic risk in complex diseases, a particularly crucial point as the field moves toward precision medicine in PD clinical trials and while recognizing the need for equitable inclusion of ancestrally diverse groups in such trials. Given the distinctive genetics of these underrepresented populations, their inclusion represents a valuable step towards insights into novel genetic determinants underlying PD etiology. This opens new avenues towards RNA-based and other therapeutic strategies aimed at reducing lifetime risk. Research in Context Evidence Before this Study Our current understanding of Parkinson's disease (PD) is disproportionately based on studying populations of European ancestry, leading to a significant gap in our knowledge about the genetics, clinical characteristics, and pathophysiology in underrepresented populations. This is particularly notable in individuals of African and African admixed ancestries. Over the last two decades, we have witnessed a revolution in the research area of complex genetic diseases. In the PD field, large-scale genome-wide association studies in the European, Asian, and Latin American populations have identified multiple risk loci associated with disease. These include 78 loci and 90 independent signals associated with PD risk in the European population, nine replicated loci and two novel population-specific signals in the Asian population, and a total of 11 novel loci recently nominated through multi-ancestry GWAS efforts.Nevertheless, the African and African admixed populations remain completely unexplored in the context of PD genetics. Added Value of this Study To address the lack of diversity in our research field, this study aimed to conduct the first genome-wide assessment of PD genetics in the African and African admixed populations. Here, we identified a genetic risk factor linked to PD etiology, dissected African-specific differences in risk and age at onset, characterized known genetic risk factors, and highlighted the utility of the African and African admixed risk haplotype substructure for future fine-mapping efforts. We identified a novel disease mechanism via expression changes consistent with decreased GBA1 activity levels. Future large scale single cell expression studies should investigate the neuronal populations in which expression differences are most prominent. This novel mechanism may hold promise for future efficient RNA-based therapeutic strategies such as antisense oligonucleotides or short interfering RNAs aimed at preventing and decreasing disease risk. We envisage that these data generated under the umbrella of the Global Parkinson's Genetics Program (GP2) will shed light on the molecular mechanisms involved in the disease process and might pave the way for future clinical trials and therapeutic interventions. This work represents a valuable resource in an underserved population, supporting pioneering research within GP2 and beyond. Deciphering causal and genetic risk factors in all these ancestries will help determine whether interventions, potential targets for disease modifying treatment, and prevention strategies that are being studied in the European populations are relevant to the African and African admixed populations. Implications of all the Available Evidence We nominate a novel signal impacting GBA1 as the major genetic risk factor for PD in the African and African admixed populations. The present study could inform future GBA1 clinical trials, improving patient stratification. In this regard, genetic testing can help to design trials likely to provide meaningful and actionable answers. It is our hope that these findings may ultimately have clinical utility for this underrepresented population.
Collapse
Affiliation(s)
- Mie Rizig
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
- UCL Movement Disorders Centre, University College London, London, WC1N 3BG, UK
| | - Sara Bandres-Ciga
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
| | - Mary B Makarious
- UCL Movement Disorders Centre, University College London, London, WC1N 3BG, UK
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Oluwadamilola Ojo
- College of Medicine, University of Lagos, Idi Araba, Lagos State, Nigeria
| | - Peter Wild Crea
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | | | - Kristin S Levine
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
- Data Tecnica International, Washington, DC, USA
| | - Sani Abubakar
- Ahmadu Bello University, Zaria, Kaduna State, Nigeria
| | - Charles Achoru
- Jos University Teaching Hospital, Jos, Plateau State, Nigeria
| | - Dan Vitale
- Data Tecnica International, Washington, DC, USA
| | | | - Osigwe Agabi
- College of Medicine, University of Lagos, Idi Araba, Lagos State, Nigeria
| | - Mathew J Koretsky
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
| | - Uchechi Agulanna
- Lagos University Teaching Hospital, Idi Araba, Lagos State, Nigeria
| | - Deborah A. Hall
- Department of Neurological Sciences, Rush University Medical Center, Chicago, IL, USA
| | - Rufus Akinyemi
- Neuroscience and Ageing Research Unit, Institute for Advanced Medical Research and Training, College of Medicine, University of Ibadan, Ibadan, Oyo State, Nigeria
| | - Tao Xie
- Department of Neurology, University of Chicago Medicine, Chicago, Illinois, USA
| | - Mohammed Ali
- Federal Teaching Hospital Gombe, Gombe State, Nigeria
| | - Ejaz A. Shamim
- Human Motor Control Section, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland, USA
- Kaiser Permanente Mid-Atlantic States, Largo, Maryland, USA
- MidAtlantic Permanente Research Institute, Rockville, Maryland, USA
| | | | - Mahesh Padmanaban
- Department of Neurology, University of Chicago Medicine, Chicago, Illinois, USA
| | | | - David G Standaert
- Department of Neurology, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Abiodun Bello
- University of Ilorin Teaching Hospital, Ilorin, Kwara State, Nigeria
| | - Marissa Dean
- Department of Neurology, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Cyril Erameh
- Irrua Specialist Teaching Hospital, Irrua, Edo State, Nigeria
| | - Inas Elsayed
- Faculty of Pharmacy, University of Gezira, Wadmadani, 20, Sudan
| | | | - Olaitan Okunoye
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
| | - Michael Fawale
- Obafemi Awolowo University, Ile-Ife, Osun State, Nigeria
| | - Kimberley J Billingsley
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | | | - Pilar Alvarez Jerez
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
| | | | - Breeana Baker
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
| | | | - Laksh Malik
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
| | - Paul Nwani
- Nnamdi Azikiwe University Teaching Hospital, Nnewi, Anambra State, Nigeria
| | - Kensuke Daida
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Ernest Nwazor
- Rivers State University Teaching Hospital, Port Harcourt, Rivers State, Nigeria
| | - Abigail Miano-Burkhardt
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Yakub Nyandaiti
- University of Maiduguri Teaching Hospital, Maiduguri, Borno State, Nigeria
| | - Zih-Hua Fang
- German Center for Neurodegenerative Diseases (DZNE), Tuebingen, Germany
| | - Yahaya Obiabo
- Federal University of Health Sciences, Otukpo, Benue State, Nigeria
| | - Jillian H. Kluss
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | | | - Dena Hernandez
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | | | - Nahid Tayebi
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Francis Ojini
- College of Medicine, University of Lagos, Idi Araba, Lagos State, Nigeria
| | - Ellen Sidranksy
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | | | - Andrea M. D’Souza
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | | | - Bahafta Berhe
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | | | - Xylena Reed
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
| | | | - Hampton Leonard
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
- Data Tecnica International, Washington, DC, USA
| | | | - Chelsea X Alvarado
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
- Data Tecnica International, Washington, DC, USA
| | | | - Simon Ozomma
- University of Calabar Teaching Hospital, Calabar, Cross River State, Nigeria
| | - Sarah Samuel
- University of Maiduguri Teaching Hospital, Maiduguri, Borno State, Nigeria
| | | | - Kolawole Wahab
- University of Ilorin Teaching Hospital, Ilorin, Kwara State, Nigeria
- University of Ilorin, Ilorin, Kwara State, Nigeria
| | - Yusuf Zubair
- National Hospital, Abuja, Federal Capital Territory, Nigeria
| | - Hirotaka Iwaki
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
- Data Tecnica International, Washington, DC, USA
| | - Jonggeol Jeffrey Kim
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Huw R Morris
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
- UCL Movement Disorders Centre, University College London, London, WC1N 3BG, UK
| | - John Hardy
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
| | - Mike Nalls
- Data Tecnica International, Washington, DC, USA
| | | | | | | | - Cornelis Blauwendraat
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Henry Houlden
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
| | - Andrew Singleton
- Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA, 20814
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Njideka Okubadejo
- College of Medicine, University of Lagos, Idi Araba, Lagos State, Nigeria
| |
Collapse
|
38
|
Liao WW, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, Lu S, Lucas JK, Monlong J, Abel HJ, Buonaiuto S, Chang XH, Cheng H, Chu J, Colonna V, Eizenga JM, Feng X, Fischer C, Fulton RS, Garg S, Groza C, Guarracino A, Harvey WT, Heumos S, Howe K, Jain M, Lu TY, Markello C, Martin FJ, Mitchell MW, Munson KM, Mwaniki MN, Novak AM, Olsen HE, Pesout T, Porubsky D, Prins P, Sibbesen JA, Sirén J, Tomlinson C, Villani F, Vollger MR, Antonacci-Fulton LL, Baid G, Baker CA, Belyaeva A, Billis K, Carroll A, Chang PC, Cody S, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Ebert P, Fairley S, Fedrigo O, Felsenfeld AL, Formenti G, Frankish A, Gao Y, Garrison NA, Giron CG, Green RE, Haggerty L, Hoekzema K, Hourlier T, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Magalhães H, Marco-Sola S, Marijon P, McCartney A, McDaniel J, Mountcastle J, Nattestad M, Nurk S, Olson ND, Popejoy AB, Puiu D, Rautiainen M, Regier AA, Rhie A, Sacco S, Sanders AD, Schneider VA, Schultz BI, Shafin K, Smith MW, Sofia HJ, Abou Tayoun AN, Thibaud-Nissen F, Tricomi FF, Wagner J, Walenz B, Wood JMD, Zimin AV, Bourque G, Chaisson MJP, Flicek P, Phillippy AM, Zook JM, Eichler EE, Haussler D, Wang T, Jarvis ED, Miga KH, Garrison E, Marschall T, Hall IM, Li H, Paten B. A draft human pangenome reference. Nature 2023; 617:312-324. [PMID: 37165242 PMCID: PMC10172123 DOI: 10.1038/s41586-023-05896-x] [Citation(s) in RCA: 170] [Impact Index Per Article: 170.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Accepted: 02/28/2023] [Indexed: 05/12/2023]
Abstract
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
Collapse
Affiliation(s)
- Wen-Wei Liao
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
- Division of Biology and Biomedical Sciences, Washington University School of Medicine, St. Louis, MO, USA
| | - Mobin Asri
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Daniel Doerr
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Marina Haukness
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Shuangjia Lu
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
| | - Julian K Lucas
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Jean Monlong
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Haley J Abel
- Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, USA
| | - Silvia Buonaiuto
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
| | - Xian H Chang
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Haoyu Cheng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Justin Chu
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Vincenza Colonna
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Xiaowen Feng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Christian Fischer
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Robert S Fulton
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Shilpa Garg
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Copenhagen, Denmark
| | - Cristian Groza
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, Germany
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge, UK
| | - Miten Jain
- Northeastern University, Boston, MA, USA
| | - Tsung-Yu Lu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Charles Markello
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Adam M Novak
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Hugh E Olsen
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Trevor Pesout
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jonas A Sibbesen
- Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Jouni Sirén
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Chad Tomlinson
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA, USA
| | | | | | - Carl A Baker
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | | | - Sarah Cody
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | | | - Robert M Cook-Deegan
- Barrett and O'Connor Washington Center, Arizona State University, Washington, DC, USA
| | - Omar E Cornejo
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA
| | - Mark Diekhans
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
| | - Susan Fairley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam L Felsenfeld
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Yan Gao
- Center for Computational and Genomic Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Nanibaa' A Garrison
- Institute for Society and Genetics, College of Letters and Science, University of California, Los Angeles, CA, USA
- Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
- Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Carlos Garcia Giron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Richard E Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA
- Dovetail Genomics, Scotts Valley, CA, USA
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Barbara A Koenig
- Program in Bioethics and Institute for Human Genetics, University of California, San Francisco, CA, USA
| | | | - Jan O Korbel
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Jennifer Kordosky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - HoJoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Hugo Magalhães
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
- Departament d'Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Pierre Marijon
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Ann McCartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | | | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Alice B Popejoy
- Department of Public Health Sciences, University of California, Davis, CA, USA
| | - Daniela Puiu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison A Regier
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Samuel Sacco
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA
| | - Ashley D Sanders
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Baergen I Schultz
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | | | - Michael W Smith
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Heidi J Sofia
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Ahmad N Abou Tayoun
- Al Jalila Genomics Center of Excellence, Al Jalila Children's Specialty Hospital, Dubai, UAE
- Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Francesca Floriana Tricomi
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brian Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Guillaume Bourque
- Department of Human Genetics, McGill University, Montréal, Québec, Canada
- Canadian Center for Computational Genomics, McGill University, Montréal, Québec, Canada
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - David Haussler
- Genomics Institute, University of California, Santa Cruz, CA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Ting Wang
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Erich D Jarvis
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Karen H Miga
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA.
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany.
| | - Ira M Hall
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA.
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA.
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, CA, USA.
| |
Collapse
|
39
|
Yao G, Chen W, Sun J, Wang X, Wang H, Meng T, Zhang L, Guo L. Gapless genome assembly of Fusarium verticillioides, a filamentous fungus threatening plant and human health. Sci Data 2023; 10:229. [PMID: 37081064 PMCID: PMC10119152 DOI: 10.1038/s41597-023-02145-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Accepted: 04/11/2023] [Indexed: 04/22/2023] Open
Abstract
Fusarium verticillioides is a filamentous fungus that causes plant diseases and harms human health through cancer-inducing mycotoxin and life-threatening Fusariosis. Given its threat to agriculture and public health, genome assembly of this fungus is critical to our understanding of its pathobiology and developing antifungal drugs. Here, we report a gap-free genome assembly of F. verticillioides using PacBio HiFi data and high-throughput chromosome capture (Hi-C) sequencing data. The assembled 42.0 Mb sequence contains eleven gapless chromosomes capturing all centromeres and 19 of all 22 telomeres. This assembly represents a significant improvement over previous version on contiguity (contig N50: 4.3 Mb), completeness (BUSCO score: 99.0%) and correctness (QV: 88.8). A total of 15,230 protein-coding genes were predicted, 6.2% of which are newly annotated genes. In addition, we identified three-dimension chromatin structures such as TADs-like structures and chromatin loops based on Hi-C data of ultra-high coverage. This gap-free genome of F. verticillioides is an excellent resource for further panoramic understanding mechanisms of fungal genome evolution, mycotoxin production and pathogenesis on plant and human host.
Collapse
Affiliation(s)
- Gang Yao
- Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences in Weifang, Weifang, Shandong, 261325, China
| | - Weikai Chen
- Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences in Weifang, Weifang, Shandong, 261325, China
| | - Jie Sun
- Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences in Weifang, Weifang, Shandong, 261325, China
| | - Xiangfeng Wang
- Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences in Weifang, Weifang, Shandong, 261325, China
| | - Huan Wang
- Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences in Weifang, Weifang, Shandong, 261325, China
| | - Tan Meng
- Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences in Weifang, Weifang, Shandong, 261325, China
- China Agricultural University, College of Information and Electrical Engineering, Beijing, 100091, China
| | - Lili Zhang
- Weifang Institute of Technology, College of Modern Agriculture and Environment, Weifang, Shandong, 262500, China
| | - Li Guo
- Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences in Weifang, Weifang, Shandong, 261325, China.
| |
Collapse
|
40
|
Olson ND, Wagner J, Dwarshuis N, Miga KH, Sedlazeck FJ, Salit M, Zook JM. Variant calling and benchmarking in an era of complete human genome sequences. Nat Rev Genet 2023:10.1038/s41576-023-00590-0. [PMID: 37059810 DOI: 10.1038/s41576-023-00590-0] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/22/2023] [Indexed: 04/16/2023]
Abstract
Genetic variant calling from DNA sequencing has enabled understanding of germline variation in hundreds of thousands of humans. Sequencing technologies and variant-calling methods have advanced rapidly, routinely providing reliable variant calls in most of the human genome. We describe how advances in long reads, deep learning, de novo assembly and pangenomes have expanded access to variant calls in increasingly challenging, repetitive genomic regions, including medically relevant regions, and how new benchmark sets and benchmarking methods illuminate their strengths and limitations. Finally, we explore the possible future of more complete characterization of human genome variation in light of the recent completion of a telomere-to-telomere human genome reference assembly and human pangenomes, and we consider the innovations needed to benchmark their newly accessible repetitive regions and complex variants.
Collapse
Affiliation(s)
- Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Nathan Dwarshuis
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Fritz J Sedlazeck
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, USA
| | | | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
| |
Collapse
|
41
|
Kirsche M, Prabhu G, Sherman R, Ni B, Battle A, Aganezov S, Schatz MC. Jasmine and Iris: population-scale structural variant comparison and analysis. Nat Methods 2023; 20:408-417. [PMID: 36658279 PMCID: PMC10006329 DOI: 10.1038/s41592-022-01753-3] [Citation(s) in RCA: 26] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Accepted: 12/15/2022] [Indexed: 01/21/2023]
Abstract
The availability of long reads is revolutionizing studies of structural variants (SVs). However, because SVs vary across individuals and are discovered through imprecise read technologies and methods, they can be difficult to compare. Addressing this, we present Jasmine and Iris ( https://github.com/mkirsche/Jasmine/ ), for fast and accurate SV refinement, comparison and population analysis. Using an SV proximity graph, Jasmine outperforms six widely used comparison methods, including reducing the rate of Mendelian discordance in trio datasets by more than fivefold, and reveals a set of high-confidence de novo SVs confirmed by multiple technologies. We also present a unified callset of 122,813 SVs and 82,379 indels from 31 samples of diverse ancestry sequenced with long reads. We genotype these variants in 1,317 samples from the 1000 Genomes Project and the Genotype-Tissue Expression project with DNA and RNA-sequencing data and assess their widespread impact on gene expression, including within medically relevant genes.
Collapse
Affiliation(s)
- Melanie Kirsche
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Gautam Prabhu
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Rachel Sherman
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Bohan Ni
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Alexis Battle
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Sergey Aganezov
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
42
|
Firtina C, Park J, Alser M, Kim JS, Cali D, Shahroodi T, Ghiasi N, Singh G, Kanellopoulos K, Alkan C, Mutlu O. BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis. NAR Genom Bioinform 2023; 5:lqad004. [PMID: 36685727 PMCID: PMC9853099 DOI: 10.1093/nargab/lqad004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 12/16/2022] [Accepted: 01/10/2023] [Indexed: 01/22/2023] Open
Abstract
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4×-83.9× (on average 19.3×), has a lower memory footprint by 0.9×-14.1× (on average 3.8×), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (on average 1.7×) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
Collapse
Affiliation(s)
- Can Firtina
- To whom correspondence should be addressed. Tel: +41 44 632 64 29;
| | - Jisung Park
- ETH Zurich, Zurich 8092, Switzerland,POSTECH, Pohang 37673, Republic of Korea
| | | | | | | | | | | | | | | | - Can Alkan
- Bilkent University, Ankara 06800, Turkey
| | - Onur Mutlu
- Correspondence may also be addressed to Onur Mutlu. Tel: +41 44 632 64 29;
| |
Collapse
|
43
|
Alonge M, Lebeigle L, Kirsche M, Jenike K, Ou S, Aganezov S, Wang X, Lippman ZB, Schatz MC, Soyk S. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biol 2022; 23:258. [PMID: 36522651 PMCID: PMC9753292 DOI: 10.1186/s13059-022-02823-7] [Citation(s) in RCA: 85] [Impact Index Per Article: 42.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Accepted: 11/28/2022] [Indexed: 12/23/2022] Open
Abstract
Advancing crop genomics requires efficient genetic systems enabled by high-quality personalized genome assemblies. Here, we introduce RagTag, a toolset for automating assembly scaffolding and patching, and we establish chromosome-scale reference genomes for the widely used tomato genotype M82 along with Sweet-100, a new rapid-cycling genotype that we developed to accelerate functional genomics and genome editing in tomato. This work outlines strategies to rapidly expand genetic systems and genomic resources in other plant species.
Collapse
Affiliation(s)
- Michael Alonge
- grid.21107.350000 0001 2171 9311Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Ludivine Lebeigle
- grid.9851.50000 0001 2165 4204Center for Integrative Genomics, University of Lausanne, CH-1015 Lausanne, Switzerland
| | - Melanie Kirsche
- grid.21107.350000 0001 2171 9311Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Katie Jenike
- grid.21107.350000 0001 2171 9311Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Shujun Ou
- grid.21107.350000 0001 2171 9311Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Sergey Aganezov
- grid.21107.350000 0001 2171 9311Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Xingang Wang
- grid.225279.90000 0004 0387 3667Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724 USA
| | - Zachary B. Lippman
- grid.225279.90000 0004 0387 3667Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724 USA ,grid.225279.90000 0004 0387 3667Howard Hughes Medical Institute, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724 USA
| | - Michael C. Schatz
- grid.21107.350000 0001 2171 9311Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA ,grid.225279.90000 0004 0387 3667Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724 USA ,grid.21107.350000 0001 2171 9311Department of Biology, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Sebastian Soyk
- grid.9851.50000 0001 2165 4204Center for Integrative Genomics, University of Lausanne, CH-1015 Lausanne, Switzerland
| |
Collapse
|
44
|
Liu Y, Zhang M, Wang R, Li B, Jiang Y, Sun M, Chang Y, Wu J. Comparison of structural variants detected by PacBio-CLR and ONT sequencing in pear. BMC Genomics 2022; 23:830. [PMID: 36517766 PMCID: PMC9753399 DOI: 10.1186/s12864-022-09074-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Accepted: 12/06/2022] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Structural variations (SVs) have recently become a topic of great interest in the area of genetic diversity and trait regulation. As genomic sequencing technologies have rapidly advanced, longer reads have been used to identify SVs at high resolution and with increased accuracy. It is important to choose a suitable sequencing platform and appropriate sequencing depth for SV detection in the pear genome. RESULTS In this study, two types of long reads from sequencing platforms, continuous long reads from Pacific Biosciences (PB-CLR) and long reads from Oxford Nanopore Technologies (ONT), were used to comprehensively analyze and compare SVs in the pear genome. The mapping rate of long reads was higher when the program Minimap2 rather than the other three mapping tools (NGMLR, LRA and Winnowmap2) was used. Three SV detection programs (Sniffles_v2, CuteSV, and Nanovar) were compared, and Nanovar had the highest sensitivity in detecting SVs at low sequencing depth (10-15×). A sequencing depth of 15× was suitable for SV detection in the pear genome using Nanovar. SVs detected by Sniffles_v2 and CuteSV with ONT reads had the high overlap with presence/absence variations (PAVs) in the pear cultivars 'Bartlett' and 'Dangshansuli', both of them with 38% of insertions and 55% of deletions overlapping with PAVs at sequencing depth of 30×. For the ONT sequencing data, over 37,526 SVs spanning ~ 28 Mb were identified by all three software packages for the 'Bartlett' and 'Dangshansuli' genomes. Those SVs were annotated and combined with transcriptome profiles derived from 'Bartlett' and 'Dangshansuli' fruit flesh at 60 days after cross-pollination. Several genes related to levels of sugars, acid, stone cells, and aromatic compounds were identified among the SVs. Transcription factors were then predicted among those genes, and results included bHLH, ERF, and MYB genes. CONCLUSION SV detection is of great significance in exploring phenotypic differences between pear varieties. Our study provides a framework for assessment of different SV software packages and sequencing platforms that can be applied in other plant genome studies. Based on these analyses, ONT sequencing data was determined to be more suitable than PB-CLR for SV detection in the pear genome. This analysis model will facilitate screening of genes related to agronomic traits in other crops.
Collapse
Affiliation(s)
- Yueyuan Liu
- grid.27871.3b0000 0000 9750 7019State Key Laboratory of Crop Genetics and Germplasm Enhancement, College of Horticulture, Nanjing Agricultural University, Nanjing, 210095 Jiangsu China
| | - Mingyue Zhang
- grid.440622.60000 0000 9482 4676College of Horticultural Science and engineering, Shandong Agricultural University, Taian, 271018 Shandong China
| | - Runze Wang
- grid.27871.3b0000 0000 9750 7019State Key Laboratory of Crop Genetics and Germplasm Enhancement, College of Horticulture, Nanjing Agricultural University, Nanjing, 210095 Jiangsu China
| | - Benping Li
- grid.410753.4Novogene Bioinformatics Institute, Beijing, China
| | - Yafei Jiang
- grid.410753.4Novogene Bioinformatics Institute, Beijing, China
| | - Manyi Sun
- grid.27871.3b0000 0000 9750 7019State Key Laboratory of Crop Genetics and Germplasm Enhancement, College of Horticulture, Nanjing Agricultural University, Nanjing, 210095 Jiangsu China
| | - Yaojun Chang
- grid.27871.3b0000 0000 9750 7019State Key Laboratory of Crop Genetics and Germplasm Enhancement, College of Horticulture, Nanjing Agricultural University, Nanjing, 210095 Jiangsu China
| | - Jun Wu
- grid.27871.3b0000 0000 9750 7019State Key Laboratory of Crop Genetics and Germplasm Enhancement, College of Horticulture, Nanjing Agricultural University, Nanjing, 210095 Jiangsu China
| |
Collapse
|
45
|
Abstract
Minimizers are widely used to sample representative k-mers from biological sequences in many applications, such as read mapping and taxonomy prediction. In most scenarios, having the minimizer scheme select as few k-mer positions as possible (i.e., having a low density) is desirable to reduce computation and memory cost. Despite the growing interest in minimizers, learning an effective scheme with optimal density is still an open question, as it requires solving an apparently challenging discrete optimization problem on the permutation space of k-mer orderings. Most existing schemes are designed to work well in expectation over random sequences, which have limited applicability to many practical tools. On the other hand, several methods have been proposed to construct minimizer schemes for a specific target sequence. These methods, however, only approximate the original objective with likewise discrete surrogate tasks that are not able to significantly improve the density performance. This article introduces the first continuous relaxation of the density minimizing objective, DeepMinimizer, which employs a novel Deep Learning twin architecture to simultaneously ensure both validity and performance of the minimizer scheme. Our surrogate objective is fully differentiable and, therefore, amenable to efficient gradient-based optimization using GPU computing. Finally, we demonstrate that DeepMinimizer discovers minimizer schemes that significantly outperform state-of-the-art constructions on human genomic sequences.
Collapse
Affiliation(s)
- Minh Hoang
- Computer Science Department, and Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.,Address correspondence to: Minh Hoang, Computer Science Department, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA
| | - Hongyu Zheng
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | | |
Collapse
|
46
|
Jain C, Gibney D, Thankachan SV. Algorithms for Colinear Chaining with Overlaps and Gap Costs. J Comput Biol 2022; 29:1237-1251. [PMID: 36351202 DOI: 10.1089/cmb.2022.0266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Colinear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the colinear chaining problem with anchor overlaps and gap costs in Õ(n) time, where n denotes the count of anchors. The degree of the polylogarithmic factor depends on the type of anchors used (e.g., fixed-length anchors) and the type of precedence an optimal anchor chain is required to satisfy. We also establish the first theoretical connection between colinear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal "anchored" edit distance equals the optimal colinear chaining cost. The anchored edit distance for two sequences and a set of anchors is only a slight generalization of the standard edit distance. It adds an additional cost of one to an alignment of two matching symbols that are not supported by any anchor. Finally, we demonstrate experimentally that optimal colinear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient >0.9 with edit distance for closely as well as distantly related sequences.
Collapse
Affiliation(s)
- Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science, Bengaluru, India
| | - Daniel Gibney
- School of Computational Science and Engineering, Georgia Institute of Technology Atlanta, Georgia, USA
| | - Sharma V. Thankachan
- Department of Computer Science, University of Central Florida, Orlando, Florida, USA
| |
Collapse
|
47
|
Das A, Schatz MC. Sketching and sampling approaches for fast and accurate long read classification. BMC Bioinformatics 2022; 23:452. [PMID: 36316646 PMCID: PMC9624007 DOI: 10.1186/s12859-022-05014-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 10/27/2022] [Indexed: 11/05/2022] Open
Abstract
BACKGROUND In modern sequencing experiments, quickly and accurately identifying the sources of the reads is a crucial need. In metagenomics, where each read comes from one of potentially many members of a community, it can be important to identify the exact species the read is from. In other settings, it is important to distinguish which reads are from the targeted sample and which are from potential contaminants. In both cases, identification of the correct source of a read enables further investigation of relevant reads, while minimizing wasted work. This task is particularly challenging for long reads, which can have a substantial error rate that obscures the origins of each read. RESULTS Existing tools for the read classification problem are often alignment or index-based, but such methods can have large time and/or space overheads. In this work, we investigate the effectiveness of several sampling and sketching-based approaches for read classification. In these approaches, a chosen sampling or sketching algorithm is used to generate a reduced representation (a "screen") of potential source genomes for a query readset before reads are streamed in and compared against this screen. Using a query read's similarity to the elements of the screen, the methods predict the source of the read. Such an approach requires limited pre-processing, stores and works with only a subset of the input data, and is able to perform classification with a high degree of accuracy. CONCLUSIONS The sampling and sketching approaches investigated include uniform sampling, methods based on MinHash and its weighted and order variants, a minimizer-based technique, and a novel clustering-based sketching approach. We demonstrate the effectiveness of these techniques both in identifying the source microbial genomes for reads from a metagenomic long read sequencing experiment, and in distinguishing between long reads from organisms of interest and potential contaminant reads. We then compare these approaches to existing alignment, index and sketching-based tools for read classification, and demonstrate how such a method is a viable alternative for determining the source of query reads. Finally, we present a reference implementation of these approaches at https://github.com/arun96/sketching .
Collapse
Affiliation(s)
- Arun Das
- grid.21107.350000 0001 2171 9311Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Michael C. Schatz
- grid.21107.350000 0001 2171 9311Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
| |
Collapse
|
48
|
Mc Cartney AM, Shafin K, Alonge M, Bzikadze AV, Formenti G, Fungtammasan A, Howe K, Jain C, Koren S, Logsdon GA, Miga KH, Mikheenko A, Paten B, Shumate A, Soto DC, Sović I, Wood JMD, Zook JM, Phillippy AM, Rhie A. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat Methods 2022; 19:687-695. [PMID: 35361931 PMCID: PMC9812399 DOI: 10.1038/s41592-022-01440-3] [Citation(s) in RCA: 35] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Accepted: 03/04/2022] [Indexed: 01/07/2023]
Abstract
Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k-mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.
Collapse
Affiliation(s)
- Ann M. Mc Cartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH
| | - Kishwar Shafin
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Michael Alonge
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Andrey V. Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California San Diego, La Jolla, CA, USA
| | - Giulio Formenti
- Laboratory of Neurogenetics of Language and The Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
| | | | | | - Chirag Jain
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH,Department of Computational and Data Sciences, Indian Institute of Science, Bangalore KA, India
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH
| | - Glennis A. Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Karen H. Miga
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Daniela C. Soto
- Genome Center, MIND Institute, Department of Biochemistry and Molecular Medicine, University of California, Davis, CA, USA
| | - Ivan Sović
- Pacific Biosciences, Menlo Park, CA, USA,Digital BioLogic d.o.o., Ivanić-Grad, Croatia
| | | | - Justin M. Zook
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Adam M. Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH,Correspondence: ,
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH,Correspondence: ,
| |
Collapse
|
49
|
Formenti G, Rhie A, Walenz BP, Thibaud-Nissen F, Shafin K, Koren S, Myers EW, Jarvis ED, Phillippy AM. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods 2022; 19:696-704. [PMID: 35361932 DOI: 10.1038/s41592-022-01445-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Accepted: 03/07/2022] [Indexed: 12/15/2022]
Abstract
Variant calling has been widely used for genotyping and for improving the consensus accuracy of long-read assemblies. Variant calls are commonly hard-filtered with user-defined cutoffs. However, it is impossible to define a single set of optimal cutoffs, as the calls heavily depend on the quality of the reads, the variant caller of choice and the quality of the unpolished assembly. Here, we introduce Merfin, a k-mer based variant-filtering algorithm for improved accuracy in genotyping and genome assembly polishing. Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller's internal score. Merfin increased the precision of genotyped calls in several benchmarks, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from Pacific Biosciences HiFi and continuous long reads or Oxford Nanopore reads, including the first complete human genome. Moreover, we introduce assembly quality and completeness metrics that account for the expected genomic copy numbers.
Collapse
|
50
|
Fang L, Wang K. Polishing high-quality genome assemblies. Nat Methods 2022. [PMID: 35610477 DOI: 10.1038/s41592-022-01515-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|