1
|
Yoo D, Rhie A, Hebbar P, Antonacci F, Logsdon GA, Solar SJ, Antipov D, Pickett BD, Safonova Y, Montinaro F, Luo Y, Malukiewicz J, Storer JM, Lin J, Sequeira AN, Mangan RJ, Hickey G, Monfort Anez G, Balachandran P, Bankevich A, Beck CR, Biddanda A, Borchers M, Bouffard GG, Brannan E, Brooks SY, Carbone L, Carrel L, Chan AP, Crawford J, Diekhans M, Engelbrecht E, Feschotte C, Formenti G, Garcia GH, de Gennaro L, Gilbert D, Green RE, Guarracino A, Gupta I, Haddad D, Han J, Harris RS, Hartley GA, Harvey WT, Hiller M, Hoekzema K, Houck ML, Jeong H, Kamali K, Kellis M, Kille B, Lee C, Lee Y, Lees W, Lewis AP, Li Q, Loftus M, Loh YHE, Loucks H, Ma J, Mao Y, Martinez JFI, Masterson P, McCoy RC, McGrath B, McKinney S, Meyer BS, Miga KH, Mohanty SK, Munson KM, Pal K, Pennell M, Pevzner PA, Porubsky D, Potapova T, Ringeling FR, Rocha JL, Ryder OA, Sacco S, Saha S, Sasaki T, Schatz MC, Schork NJ, Shanks C, Smeds L, Son DR, Steiner C, Sweeten AP, Tassia MG, Thibaud-Nissen F, Torres-González E, Trivedi M, Wei W, Wertz J, Yang M, Zhang P, Zhang S, Zhang Y, Zhang Z, et alYoo D, Rhie A, Hebbar P, Antonacci F, Logsdon GA, Solar SJ, Antipov D, Pickett BD, Safonova Y, Montinaro F, Luo Y, Malukiewicz J, Storer JM, Lin J, Sequeira AN, Mangan RJ, Hickey G, Monfort Anez G, Balachandran P, Bankevich A, Beck CR, Biddanda A, Borchers M, Bouffard GG, Brannan E, Brooks SY, Carbone L, Carrel L, Chan AP, Crawford J, Diekhans M, Engelbrecht E, Feschotte C, Formenti G, Garcia GH, de Gennaro L, Gilbert D, Green RE, Guarracino A, Gupta I, Haddad D, Han J, Harris RS, Hartley GA, Harvey WT, Hiller M, Hoekzema K, Houck ML, Jeong H, Kamali K, Kellis M, Kille B, Lee C, Lee Y, Lees W, Lewis AP, Li Q, Loftus M, Loh YHE, Loucks H, Ma J, Mao Y, Martinez JFI, Masterson P, McCoy RC, McGrath B, McKinney S, Meyer BS, Miga KH, Mohanty SK, Munson KM, Pal K, Pennell M, Pevzner PA, Porubsky D, Potapova T, Ringeling FR, Rocha JL, Ryder OA, Sacco S, Saha S, Sasaki T, Schatz MC, Schork NJ, Shanks C, Smeds L, Son DR, Steiner C, Sweeten AP, Tassia MG, Thibaud-Nissen F, Torres-González E, Trivedi M, Wei W, Wertz J, Yang M, Zhang P, Zhang S, Zhang Y, Zhang Z, Zhao SA, Zhu Y, Jarvis ED, Gerton JL, Rivas-González I, Paten B, Szpiech ZA, Huber CD, Lenz TL, Konkel MK, Yi SV, Canzar S, Watson CT, Sudmant PH, Molloy E, Garrison E, Lowe CB, Ventura M, O'Neill RJ, Koren S, Makova KD, Phillippy AM, Eichler EE. Complete sequencing of ape genomes. Nature 2025; 641:401-418. [PMID: 40205052 PMCID: PMC12058530 DOI: 10.1038/s41586-025-08816-3] [Show More Authors] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Accepted: 02/19/2025] [Indexed: 04/11/2025]
Abstract
The most dynamic and repetitive regions of great ape genomes have traditionally been excluded from comparative studies1-3. Consequently, our understanding of the evolution of our species is incomplete. Here we present haplotype-resolved reference genomes and comparative analyses of six ape species: chimpanzee, bonobo, gorilla, Bornean orangutan, Sumatran orangutan and siamang. We achieve chromosome-level contiguity with substantial sequence accuracy (<1 error in 2.7 megabases) and completely sequence 215 gapless chromosomes telomere-to-telomere. We resolve challenging regions, such as the major histocompatibility complex and immunoglobulin loci, to provide in-depth evolutionary insights. Comparative analyses enabled investigations of the evolution and diversity of regions previously uncharacterized or incompletely studied without bias from mapping to the human reference genome. Such regions include newly minted gene families in lineage-specific segmental duplications, centromeric DNA, acrocentric chromosomes and subterminal heterochromatin. This resource serves as a comprehensive baseline for future evolutionary studies of humans and our closest living ape relatives.
Collapse
Affiliation(s)
- DongAhn Yoo
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Arang Rhie
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Prajna Hebbar
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Francesca Antonacci
- Department of Biosciences, Biotechnology and Environment, University of Bari, Bari, Italy
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Department of Genetics, Epigenetics Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Steven J Solar
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Dmitry Antipov
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Brandon D Pickett
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Yana Safonova
- Computer Science and Engineering Department, Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA, USA
| | - Francesco Montinaro
- Department of Biosciences, Biotechnology and Environment, University of Bari, Bari, Italy
- Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Yanting Luo
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC, USA
| | - Joanna Malukiewicz
- Research Unit for Evolutionary Immunogenomics, Department of Biology, University of Hamburg, Hamburg, Germany
- German Primate Center, Primate Genetics Laboratory, Goettingen, Germany
| | - Jessica M Storer
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
| | - Jiadong Lin
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Riley J Mangan
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Genetics Training Program, Harvard Medical School, Boston, MA, USA
| | - Glenn Hickey
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | | | | | - Anton Bankevich
- Computer Science and Engineering Department, Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA, USA
| | - Christine R Beck
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
- Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, USA
| | - Arjun Biddanda
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | | | - Gerard G Bouffard
- NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Emry Brannan
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Shelise Y Brooks
- NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Lucia Carbone
- Department of Medicine, KCVI, Oregon Health Sciences University, Portland, OR, USA
- Division of Genetics, Oregon National Primate Research Center, Beaverton, OR, USA
| | - Laura Carrel
- PSU Medical School, Penn State University School of Medicine, Hershey, PA, USA
| | - Agnes P Chan
- The Translational Genomics Research Institute, City of Hope National Medical Center, Phoenix, AZ, USA
| | - Juyun Crawford
- NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Eric Engelbrecht
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Louisville, Louisville, KY, USA
| | - Cedric Feschotte
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA
| | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Gage H Garcia
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Luciana de Gennaro
- Department of Biosciences, Biotechnology and Environment, University of Bari, Bari, Italy
| | - David Gilbert
- San Diego Biomedical Research Institute, San Diego, CA, USA
| | - Richard E Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ishaan Gupta
- Department of Computer Science and Engineering, University of California, San Diego, San Diego, CA, USA
| | - Diana Haddad
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Junmin Han
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Robert S Harris
- Department of Biology, Penn State University, University Park, PA, USA
| | | | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Michael Hiller
- LOEWE Centre for Translational Biodiversity Genomics, Frankfurt, Germany
- Senckenberg Research Institute, Frankfurt, Germany
- Institute of Cell Biology and Neuroscience, Faculty of Biosciences, Goethe University Frankfurt, Frankfurt, Germany
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Hyeonsoo Jeong
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Kaivan Kamali
- Department of Biology, Penn State University, University Park, PA, USA
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Bryce Kille
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Chul Lee
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Youngho Lee
- Laboratory of Bioinformatics and Population Genetics, Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - William Lees
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Louisville, Louisville, KY, USA
- Bioengineering Program, Faculty of Engineering, Bar-Ilan University, Ramat Gan, Israel
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Qiuhui Li
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Mark Loftus
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA
- Center for Human Genetics, Clemson University, Greenwood, SC, USA
| | - Yong Hwee Eddie Loh
- Neuroscience Research Institute, University of California, Santa Barbara, Santa Barbara, CA, USA
| | - Hailey Loucks
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Jian Ma
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Yafei Mao
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
- Center for Genomic Research, International Institutes of Medicine, Fourth Affiliated Hospital, Zhejiang University, Yiwu, China
- Shanghai Jiao Tong University Chongqing Research Institute, Chongqing, China
| | - Juan F I Martinez
- Computer Science and Engineering Department, Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA, USA
| | - Patrick Masterson
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Barbara McGrath
- Department of Biology, Penn State University, University Park, PA, USA
| | - Sean McKinney
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Britta S Meyer
- Research Unit for Evolutionary Immunogenomics, Department of Biology, University of Hamburg, Hamburg, Germany
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Saswat K Mohanty
- Department of Biology, Penn State University, University Park, PA, USA
| | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Karol Pal
- Department of Biology, Penn State University, University Park, PA, USA
| | - Matt Pennell
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Pavel A Pevzner
- Department of Computer Science and Engineering, University of California, San Diego, San Diego, CA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Tamara Potapova
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Francisca R Ringeling
- Faculty of Informatics and Data Science, University of Regensburg, Regensburg, Germany
| | - Joana L Rocha
- Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA
| | | | - Samuel Sacco
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Swati Saha
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Louisville, Louisville, KY, USA
| | - Takayo Sasaki
- San Diego Biomedical Research Institute, San Diego, CA, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Nicholas J Schork
- The Translational Genomics Research Institute, City of Hope National Medical Center, Phoenix, AZ, USA
| | - Cole Shanks
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Linnéa Smeds
- Department of Biology, Penn State University, University Park, PA, USA
| | - Dongmin R Son
- Department of Ecology, Evolution and Marine Biology, Neuroscience Research Institute, University of California, Santa Barbara, Santa Barbara, CA, USA
| | | | - Alexander P Sweeten
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Michael G Tassia
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | - Mihir Trivedi
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Wenjie Wei
- School of Life Sciences, Westlake University, Hangzhou, China
- National Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, China
| | - Julie Wertz
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Muyu Yang
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Panpan Zhang
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA
| | - Shilong Zhang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Yang Zhang
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Zhenmiao Zhang
- Department of Computer Science and Engineering, University of California, San Diego, San Diego, CA, USA
| | - Sarah A Zhao
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Yixin Zhu
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Erich D Jarvis
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | | | - Iker Rivas-González
- Department of Primate Behavior and Evolution, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Zachary A Szpiech
- Department of Biology, Penn State University, University Park, PA, USA
| | - Christian D Huber
- Department of Biology, Penn State University, University Park, PA, USA
| | - Tobias L Lenz
- Research Unit for Evolutionary Immunogenomics, Department of Biology, University of Hamburg, Hamburg, Germany
| | - Miriam K Konkel
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA
- Center for Human Genetics, Clemson University, Greenwood, SC, USA
| | - Soojin V Yi
- Department of Ecology, Evolution and Marine Biology, Neuroscience Research Institute, University of California, Santa Barbara, Santa Barbara, CA, USA
- Department of Molecular, Cellular and Developmental Biology, Neuroscience Research Institute, University of California, Santa Barbara, Santa Barbara, CA, USA
| | - Stefan Canzar
- Faculty of Informatics and Data Science, University of Regensburg, Regensburg, Germany
| | - Corey T Watson
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Louisville, Louisville, KY, USA
| | - Peter H Sudmant
- Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Erin Molloy
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Craig B Lowe
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC, USA
| | - Mario Ventura
- Department of Biosciences, Biotechnology and Environment, University of Bari, Bari, Italy
| | - Rachel J O'Neill
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, USA
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Sergey Koren
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Kateryna D Makova
- Department of Biology, Penn State University, University Park, PA, USA.
| | - Adam M Phillippy
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
- Howard Hughes Medical Institute, Chevy Chase, MD, USA.
| |
Collapse
|
2
|
Groot Koerkamp R, Liu D, Pibiri GE. The open-closed mod-minimizer algorithm. Algorithms Mol Biol 2025; 20:4. [PMID: 40098006 PMCID: PMC11912762 DOI: 10.1186/s13015-025-00270-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 01/28/2025] [Indexed: 03/19/2025] Open
Abstract
Sampling algorithms that deterministically select a subset of k -mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one k -mer out of every window of w consecutive k -mers. The folklore and most used scheme is the random minimizer that selects the smallest k -mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected k -mers) of 2 / ( w + 1 ) . In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1/w when k → ∞ , like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024). In this work, we study methods that achieve low density when k ≤ w . In this small-k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method-the open-closed minimizer-offers improved density for small k ≤ w while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small-k regime, our method has comparable density while being computationally simpler and intuitive. Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when k > w is large. We hence obtain the open-closed mod-minimizer, a practical method that improves over the mod-minimizer for all k.
Collapse
Affiliation(s)
| | - Daniel Liu
- University of California, Los Angeles, California, USA
| | | |
Collapse
|
3
|
Santoro DF, Marconi G, Capomaccio S, Bocchini M, Anderson AW, Finotti A, Confalonieri M, Albertini E, Rosellini D. Polyploidization-driven transcriptomic dynamics in Medicago sativa neotetraploids: mRNA, smRNA and allele-specific gene expression. BMC PLANT BIOLOGY 2025; 25:108. [PMID: 39856624 PMCID: PMC11763150 DOI: 10.1186/s12870-025-06090-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/05/2024] [Accepted: 01/09/2025] [Indexed: 01/27/2025]
Abstract
Whole genome duplication (WGD) is a powerful evolutionary mechanism in plants. Autopolyploids have been comparatively less studied than allopolyploids, with sexual autopolyploidization receiving even less attention. In this work, we studied the transcriptomes of neotetraploids (2n = 4x = 32) obtained by crossing two diploid (2n = 2x = 16) plants of Medicago sativa that produce a significant percentage of either 2n eggs or pollen. Diploid progeny from the same cross allowed us to separate the transcriptional outcomes of hybridization from those of WGD. This material can help to elucidate events at the base of the domestication of cultivated 4x alfalfa, the world's most important leguminous forage. Three 2x and three 4x progeny plants and 2x parental plants were used for this study. The RNA-seq data revealed that WGD did not dramatically affect the transcription of leaf protein-coding genes. The two parental genotypes did not contribute equally to the progeny transcriptomes, and genome-wide expression level dominance of the male parent was observed. A large majority of the genes whose expression level changed due to WGD presented increased expression, indicating that the 4x state requires the upregulation of approximately 2.66% of the protein-coding genes. Overall, we estimated that 3.63% of the protein-coding genes were transcriptionally affected by WGD and may contribute to the phenotypic novelty of the neotetraploid plants. Pathway analysis suggested that WGD could affect secondary metabolite biosynthesis, which in turn may influence forage quality. We found four times as many transcription factor genes among the polyploidization-affected genes than among those affected only by hybridization. Several of these belong to classes involved in stress response. Small RNA-seq revealed that very few miRNAs were significantly associated with WGD, but they target several hundred genes, and their role in the WGD response may be relevant. Integrated network analysis led to the identification of putative miRNA: mRNA interactions potentially involved in transcriptome reprogramming. Allele-specific expression analysis indicated that parent-of-origin bias was not a significant outcome of WGD, but we found that parentally biased RNA editing may be a significant source of variation in neopolyploids.
Collapse
Affiliation(s)
- D F Santoro
- Department of Agricultural, Food and Environmental Sciences, University of Perugia, via Borgo XX giugno 74, Perugia, 06121, Italy
| | - G Marconi
- Department of Agricultural, Food and Environmental Sciences, University of Perugia, via Borgo XX giugno 74, Perugia, 06121, Italy
- Interuniversity Consortium for Biotechnology (CIB), Area Science Park, Padriciano 99, Trieste, 34149, Italy
| | - S Capomaccio
- Interuniversity Consortium for Biotechnology (CIB), Area Science Park, Padriciano 99, Trieste, 34149, Italy
- Department of Veterinary Medicine, University of Perugia, via S. Costanzo 4, Perugia, 06126, Italy
| | - M Bocchini
- Department of Agricultural, Food and Environmental Sciences, University of Perugia, via Borgo XX giugno 74, Perugia, 06121, Italy
| | - A W Anderson
- Department of Agricultural, Food and Environmental Sciences, University of Perugia, via Borgo XX giugno 74, Perugia, 06121, Italy
| | - A Finotti
- Interuniversity Consortium for Biotechnology (CIB), Area Science Park, Padriciano 99, Trieste, 34149, Italy
- Department of Life Sciences and Biotechnology, Section of Biochemistry and Molecular Biology, University of Ferrara, via Fossato di Mortara 74, Ferrara, 44121, Italy
| | - M Confalonieri
- CREA Research Centre for Animal Production and Aquaculture (CREA-ZA), Viale Piacenza 29, Lodi, 26900, Italy
| | - E Albertini
- Department of Agricultural, Food and Environmental Sciences, University of Perugia, via Borgo XX giugno 74, Perugia, 06121, Italy
- Interuniversity Consortium for Biotechnology (CIB), Area Science Park, Padriciano 99, Trieste, 34149, Italy
| | - D Rosellini
- Department of Agricultural, Food and Environmental Sciences, University of Perugia, via Borgo XX giugno 74, Perugia, 06121, Italy.
- Interuniversity Consortium for Biotechnology (CIB), Area Science Park, Padriciano 99, Trieste, 34149, Italy.
| |
Collapse
|
4
|
Janssen A, Gibson P, Bravo A, de Bakker V, Slager J, Veening JW. PneumoBrowse 2: an integrated visual platform for curated genome annotation and multiomics data analysis of Streptococcus pneumoniae. Nucleic Acids Res 2025; 53:D839-D851. [PMID: 39436044 PMCID: PMC11701578 DOI: 10.1093/nar/gkae923] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2024] [Revised: 09/30/2024] [Accepted: 10/04/2024] [Indexed: 10/23/2024] Open
Abstract
Streptococcus pneumoniae is an opportunistic human pathogen responsible for high morbidity and mortality rates. Extensive genome sequencing revealed its large pangenome, serotype diversity, and provided insight into genome dynamics. However, functional genome analysis has lagged behind, as that requires detailed and time-consuming manual curation of genome annotations and integration of genomic and phenotypic data. To remedy this, PneumoBrowse was presented in 2018, a user-friendly interactive online platform, which provided the detailed annotation of the S. pneumoniae D39V genome, alongside transcriptomic data. Since 2018, many new studies on S. pneumoniae genome biology and protein functioning have been performed. Here, we present PneumoBrowse 2 (https://veeninglab.com/pneumobrowse), fully rebuilt in JBrowse 2. We updated annotations for transcribed and transcriptional regulatory features in the D39V genome. We added genome-wide data tracks for high-resolution chromosome conformation capture (Hi-C) data, chromatin immunoprecipitation coupled to high-throughput sequencing (ChIP-Seq), ribosome profiling, CRISPRi-seq gene essentiality data and more. Additionally, we included 18 phylogenetically diverse S. pneumoniae genomes and their annotations. By providing easy access to diverse high-quality genome annotations and links to other databases (including UniProt and AlphaFold), PneumoBrowse 2 will further accelerate research and development into preventive and treatment strategies, through increased understanding of the pneumococcal genome.
Collapse
Affiliation(s)
- Axel B Janssen
- Department of Fundamental Microbiology, Faculty of Biology and Medicine, University of Lausanne, Biophore Building, 1015, Lausanne, Switzerland
| | - Paddy S Gibson
- Department of Fundamental Microbiology, Faculty of Biology and Medicine, University of Lausanne, Biophore Building, 1015, Lausanne, Switzerland
| | - Afonso M Bravo
- Department of Fundamental Microbiology, Faculty of Biology and Medicine, University of Lausanne, Biophore Building, 1015, Lausanne, Switzerland
| | - Vincent de Bakker
- Department of Fundamental Microbiology, Faculty of Biology and Medicine, University of Lausanne, Biophore Building, 1015, Lausanne, Switzerland
| | - Jelle Slager
- Department of Genetics, University of Groningen, University Medical Center Groningen, 9713 GZ, Groningen, the Netherlands
| | - Jan-Willem Veening
- Department of Fundamental Microbiology, Faculty of Biology and Medicine, University of Lausanne, Biophore Building, 1015, Lausanne, Switzerland
| |
Collapse
|
5
|
Kille B, Groot Koerkamp R, McAdams D, Liu A, Treangen TJ. A near-tight lower bound on the density of forward sampling schemes. Bioinformatics 2024; 41:btae736. [PMID: 39666942 PMCID: PMC11676336 DOI: 10.1093/bioinformatics/btae736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Revised: 11/16/2024] [Accepted: 12/10/2024] [Indexed: 12/14/2024] Open
Abstract
MOTIVATION Sampling k-mers is a ubiquitous task in sequence analysis algorithms. Sampling schemes such as the often-used random minimizer scheme are particularly appealing as they guarantee at least one k-mer is selected out of every w consecutive k-mers. Sampling fewer k-mers often leads to an increase in efficiency of downstream methods. Thus, developing schemes that have low density, i.e. have a small proportion of sampled k-mers, is an active area of research. After over a decade of consistent efforts in both decreasing the density of practical schemes and increasing the lower bound on the best possible density, there is still a large gap between the two. RESULTS We prove a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small w and k, we observe that our bound is tight when k≡1(mod w). For large w and k, the bound can be approximated by 1w+k⌈w+kw⌉. Importantly, our lower bound implies that existing schemes are much closer to achieving optimal density than previously known. For example, with the current default minimap2 HiFi settings w = 19 and k = 19, we show that the best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al. is at most 3% denser than optimal, compared to the previous gap of at most 50%. Furthermore, when k≡1(mod w) and the alphabet size σ goes to ∞, we show that mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching our lower bound. AVAILABILITY AND IMPLEMENTATION Minimizer implementations: github.com/RagnarGrootKoerkamp/minimizers ILP and analysis: github.com/treangenlab/sampling-scheme-analysis.
Collapse
Affiliation(s)
- Bryce Kille
- Department of Computer Science, Rice University, Houston, TX 77005, United States
| | | | - Drake McAdams
- Department of Computer Science, Rice University, Houston, TX 77005, United States
| | - Alan Liu
- Department of Computer Science, Rice University, Houston, TX 77005, United States
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX 77005, United States
- Ken Kennedy Institute, Rice University, Houston, TX 77005, United States
| |
Collapse
|
6
|
Kille B, Koerkamp RG, McAdams D, Liu A, Treangen TJ. A near-tight lower bound on the density of forward sampling schemes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.06.611668. [PMID: 39605515 PMCID: PMC11601301 DOI: 10.1101/2024.09.06.611668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Motivation Sampling k -mers is a ubiquitous task in sequence analysis algorithms. Sampling schemes such as the often-used random minimizer scheme are particularly appealing as they guarantee at least one k -mer is selected out of every w consecutive k -mers. Sampling fewer k -mers often leads to an increase in efficiency of downstream methods. Thus, developing schemes that have low density, i.e., have a small proportion of sampled k -mers, is an active area of research. After over a decade of consistent efforts in both decreasing the density of practical schemes and increasing the lower bound on the best possible density, there is still a large gap between the two. Results We prove a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small w and k , we observe that our bound is tight when k ≡ 1 (mod w ). For large w and k , the bound can be approximated by1 w + k w + k w . Importantly, our lower bound implies that existing schemes are much closer to achieving optimal density than previously known. For example, with the current default minimap2 HiFi settings w = 19 and k = 19 , we show that the best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al., is at most 3% denser than optimal, compared to the previous gap of at most 50%. Furthermore, when k ≡ 1 (mod w ) and the alphabet size σ goes to ∞ , we show that mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching our lower bound.
Collapse
Affiliation(s)
- Bryce Kille
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | - Drake McAdams
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Alan Liu
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Todd J. Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
- Ken Kennedy Institute, Rice University, Houston, TX, USA
| |
Collapse
|
7
|
Marçais G, Elder CS, Kingsford C. k-nonical space: sketching with reverse complements. Bioinformatics 2024; 40:btae629. [PMID: 39432565 PMCID: PMC11549021 DOI: 10.1093/bioinformatics/btae629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 10/01/2024] [Accepted: 10/17/2024] [Indexed: 10/23/2024] Open
Abstract
MOTIVATION Sequences equivalent to their reverse complements (i.e. double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g. sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space). RESULTS The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome ("sketching deserts") are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (i) a new procedure that adapts existing sketching methods to k-nonical space and (ii) an optimization procedure to directly design new sketching methods for k-nonical space. AVAILABILITY AND IMPLEMENTATION The code used in this analysis is available under a permissive license at https://github.com/Kingsford-Group/mdsscope.
Collapse
Affiliation(s)
- Guillaume Marçais
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - C S Elder
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - Carl Kingsford
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| |
Collapse
|
8
|
Ndiaye M, Prieto-Baños S, Fitzgerald LM, Yazdizadeh Kharrazi A, Oreshkov S, Dessimoz C, Sedlazeck FJ, Glover N, Majidian S. When less is more: sketching with minimizers in genomics. Genome Biol 2024; 25:270. [PMID: 39402664 PMCID: PMC11472564 DOI: 10.1186/s13059-024-03414-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 10/01/2024] [Indexed: 10/19/2024] Open
Abstract
The exponential increase in sequencing data calls for conceptual and computational advances to extract useful biological insights. One such advance, minimizers, allows for reducing the quantity of data handled while maintaining some of its key properties. We provide a basic introduction to minimizers, cover recent methodological developments, and review the diverse applications of minimizers to analyze genomic data, including de novo genome assembly, metagenomics, read alignment, read correction, and pangenomes. We also touch on alternative data sketching techniques including universal hitting sets, syncmers, or strobemers. Minimizers and their alternatives have rapidly become indispensable tools for handling vast amounts of data.
Collapse
Affiliation(s)
- Malick Ndiaye
- Department of Fundamental Microbiology, UNIL, Lausanne, Switzerland
| | - Silvia Prieto-Baños
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | | | - Sergey Oreshkov
- Department of Endocrinology, Diabetology, Metabolism, CHUV, Lausanne, Switzerland
| | - Christophe Dessimoz
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | - Natasha Glover
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Sina Majidian
- Department of Computational Biology, UNIL, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| |
Collapse
|
9
|
Yoo D, Rhie A, Hebbar P, Antonacci F, Logsdon GA, Solar SJ, Antipov D, Pickett BD, Safonova Y, Montinaro F, Luo Y, Malukiewicz J, Storer JM, Lin J, Sequeira AN, Mangan RJ, Hickey G, Anez GM, Balachandran P, Bankevich A, Beck CR, Biddanda A, Borchers M, Bouffard GG, Brannan E, Brooks SY, Carbone L, Carrel L, Chan AP, Crawford J, Diekhans M, Engelbrecht E, Feschotte C, Formenti G, Garcia GH, de Gennaro L, Gilbert D, Green RE, Guarracino A, Gupta I, Haddad D, Han J, Harris RS, Hartley GA, Harvey WT, Hiller M, Hoekzema K, Houck ML, Jeong H, Kamali K, Kellis M, Kille B, Lee C, Lee Y, Lees W, Lewis AP, Li Q, Loftus M, Loh YHE, Loucks H, Ma J, Mao Y, Martinez JFI, Masterson P, McCoy RC, McGrath B, McKinney S, Meyer BS, Miga KH, Mohanty SK, Munson KM, Pal K, Pennell M, Pevzner PA, Porubsky D, Potapova T, Ringeling FR, Roha JL, Ryder OA, Sacco S, Saha S, Sasaki T, Schatz MC, Schork NJ, Shanks C, Smeds L, Son DR, Steiner C, Sweeten AP, Tassia MG, Thibaud-Nissen F, Torres-González E, Trivedi M, Wei W, Wertz J, Yang M, Zhang P, Zhang S, Zhang Y, Zhang Z, et alYoo D, Rhie A, Hebbar P, Antonacci F, Logsdon GA, Solar SJ, Antipov D, Pickett BD, Safonova Y, Montinaro F, Luo Y, Malukiewicz J, Storer JM, Lin J, Sequeira AN, Mangan RJ, Hickey G, Anez GM, Balachandran P, Bankevich A, Beck CR, Biddanda A, Borchers M, Bouffard GG, Brannan E, Brooks SY, Carbone L, Carrel L, Chan AP, Crawford J, Diekhans M, Engelbrecht E, Feschotte C, Formenti G, Garcia GH, de Gennaro L, Gilbert D, Green RE, Guarracino A, Gupta I, Haddad D, Han J, Harris RS, Hartley GA, Harvey WT, Hiller M, Hoekzema K, Houck ML, Jeong H, Kamali K, Kellis M, Kille B, Lee C, Lee Y, Lees W, Lewis AP, Li Q, Loftus M, Loh YHE, Loucks H, Ma J, Mao Y, Martinez JFI, Masterson P, McCoy RC, McGrath B, McKinney S, Meyer BS, Miga KH, Mohanty SK, Munson KM, Pal K, Pennell M, Pevzner PA, Porubsky D, Potapova T, Ringeling FR, Roha JL, Ryder OA, Sacco S, Saha S, Sasaki T, Schatz MC, Schork NJ, Shanks C, Smeds L, Son DR, Steiner C, Sweeten AP, Tassia MG, Thibaud-Nissen F, Torres-González E, Trivedi M, Wei W, Wertz J, Yang M, Zhang P, Zhang S, Zhang Y, Zhang Z, Zhao SA, Zhu Y, Jarvis ED, Gerton JL, Rivas-González I, Paten B, Szpiech ZA, Huber CD, Lenz TL, Konkel MK, Yi SV, Canzar S, Watson CT, Sudmant PH, Molloy E, Garrison E, Lowe CB, Ventura M, O’Neill RJ, Koren S, Makova KD, Phillippy AM, Eichler EE. Complete sequencing of ape genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.31.605654. [PMID: 39131277 PMCID: PMC11312596 DOI: 10.1101/2024.07.31.605654] [Show More Authors] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/13/2024]
Abstract
We present haplotype-resolved reference genomes and comparative analyses of six ape species, namely: chimpanzee, bonobo, gorilla, Bornean orangutan, Sumatran orangutan, and siamang. We achieve chromosome-level contiguity with unparalleled sequence accuracy (<1 error in 500,000 base pairs), completely sequencing 215 gapless chromosomes telomere-to-telomere. We resolve challenging regions, such as the major histocompatibility complex and immunoglobulin loci, providing more in-depth evolutionary insights. Comparative analyses, including human, allow us to investigate the evolution and diversity of regions previously uncharacterized or incompletely studied without bias from mapping to the human reference. This includes newly minted gene families within lineage-specific segmental duplications, centromeric DNA, acrocentric chromosomes, and subterminal heterochromatin. This resource should serve as a definitive baseline for all future evolutionary studies of humans and our closest living ape relatives.
Collapse
Affiliation(s)
- DongAhn Yoo
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Arang Rhie
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Prajna Hebbar
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Francesca Antonacci
- Department of Biosciences, Biotechnology and Environment, University of Bari, Bari, 70124, Italy
| | - Glennis A. Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Department of Genetics, Epigenetics Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19103, USA
| | - Steven J. Solar
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Dmitry Antipov
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Brandon D. Pickett
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Yana Safonova
- Computer Science and Engineering Department, Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA
| | - Francesco Montinaro
- Department of Biosciences, Biotechnology and Environment, University of Bari, Bari, 70124, Italy
- Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Yanting Luo
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC 27710, USA
| | - Joanna Malukiewicz
- Research Unit for Evolutionary Immunogenomics, Department of Biology, University of Hamburg, 20146 Hamburg, Germany
| | - Jessica M. Storer
- Institute for Systems Genomics, University of Connecticut, Storrs, CT 06269, USA
| | - Jiadong Lin
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Abigail N. Sequeira
- Department of Biology, Penn State University, University Park, PA 16802, USA
| | - Riley J. Mangan
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Genetics Training Program, Harvard Medical School, Boston, MA 02115, USA
| | - Glenn Hickey
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | | | | | - Anton Bankevich
- Computer Science and Engineering Department, Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA
| | - Christine R. Beck
- Institute for Systems Genomics, University of Connecticut, Storrs, CT 06269, USA
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
- Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, USA
| | - Arjun Biddanda
- Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Matthew Borchers
- Stowers Institute for Medical Research, Kansas City, MO 64110, USA
| | - Gerard G. Bouffard
- NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Emry Brannan
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Shelise Y. Brooks
- NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Lucia Carbone
- Department of Medicine, KCVI, Oregon Health Sciences University, Portland, OR, USA
- Division of Genetics, Oregon National Primate Research Center, Beaverton, OR, USA
| | - Laura Carrel
- PSU Medical School, Penn State University School of Medicine, Hershey, PA, USA
| | - Agnes P. Chan
- The Translational Genomics Research Institute, a part of the City of Hope National Medical Center, Phoenix, AZ, USA
| | - Juyun Crawford
- NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Eric Engelbrecht
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Louisville, Louisville, KY, USA
| | - Cedric Feschotte
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA
| | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY 10021, USA
| | - Gage H. Garcia
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Luciana de Gennaro
- Department of Biosciences, Biotechnology and Environment, University of Bari, Bari, 70124, Italy
| | - David Gilbert
- San Diego Biomedical Research Institute, San Diego, CA, USA
| | | | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Ishaan Gupta
- Department of Computer Science and Engineering, University of California San Diego, CA, USA
| | - Diana Haddad
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Junmin Han
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Robert S. Harris
- Department of Biology, Penn State University, University Park, PA 16802, USA
| | - Gabrielle A. Hartley
- Institute for Systems Genomics, University of Connecticut, Storrs, CT 06269, USA
| | - William T. Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Michael Hiller
- LOEWE Centre for Translational Biodiversity Genomics, Senckenberg Research Institute, Goethe University, Frankfurt, Germany
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Marlys L. Houck
- San Diego Zoo Wildlife Alliance, Escondido, CA, 92027-7000, USA
| | - Hyeonsoo Jeong
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Kaivan Kamali
- Department of Biology, Penn State University, University Park, PA 16802, USA
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Bryce Kille
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Chul Lee
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Youngho Lee
- Laboratory of bioinformatics and population genetics, Interdisciplinary program in bioinformatics, Seoul National University, Republic of Korea
| | - William Lees
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Louisville, Louisville, KY, USA
- Bioengineering Program, Faculty of Engineering, Bar-Ilan University, Ramat Gan, Israel
| | - Alexandra P. Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Qiuhui Li
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Mark Loftus
- Department of Genetics & Biochemistry, Clemson University, Clemson, SC, USA
- Center for Human Genetics, Clemson University, Greenwood, SC, USA
| | - Yong Hwee Eddie Loh
- Neuroscience Research Institute, University of California, Santa Barbara, CA, USA
| | - Hailey Loucks
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Jian Ma
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, PA, USA
| | - Yafei Mao
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
- Center for Genomic Research, International Institutes of Medicine, Fourth Affiliated Hospital, Zhejiang University, Yiwu, Zhejiang, China
- Shanghai Jiao Tong University Chongqing Research Institute, Chongqing, China
| | - Juan F. I. Martinez
- Computer Science and Engineering Department, Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA
| | - Patrick Masterson
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Rajiv C. McCoy
- Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Barbara McGrath
- Department of Biology, Penn State University, University Park, PA 16802, USA
| | - Sean McKinney
- Stowers Institute for Medical Research, Kansas City, MO 64110, USA
| | - Britta S. Meyer
- Research Unit for Evolutionary Immunogenomics, Department of Biology, University of Hamburg, 20146 Hamburg, Germany
| | - Karen H. Miga
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Saswat K. Mohanty
- Department of Biology, Penn State University, University Park, PA 16802, USA
| | - Katherine M. Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Karol Pal
- Department of Biology, Penn State University, University Park, PA 16802, USA
| | - Matt Pennell
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Pavel A. Pevzner
- Department of Computer Science and Engineering, University of California San Diego, CA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Tamara Potapova
- Stowers Institute for Medical Research, Kansas City, MO 64110, USA
| | - Francisca R. Ringeling
- Faculty of Informatics and Data Science, University of Regensburg, 93053 Regensburg, Germany
| | - Joana L. Roha
- Department of Integrative Biology, University of California, Berkeley, Berkeley, USA
| | - Oliver A. Ryder
- San Diego Zoo Wildlife Alliance, Escondido, CA, 92027-7000, USA
| | - Samuel Sacco
- University of California Santa Cruz, Santa Cruz, CA, USA
| | - Swati Saha
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Louisville, Louisville, KY, USA
| | - Takayo Sasaki
- San Diego Biomedical Research Institute, San Diego, CA, USA
| | - Michael C. Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Nicholas J. Schork
- The Translational Genomics Research Institute, a part of the City of Hope National Medical Center, Phoenix, AZ, USA
| | - Cole Shanks
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Linnéa Smeds
- Department of Biology, Penn State University, University Park, PA 16802, USA
| | - Dongmin R. Son
- Department of Ecology, Evolution and Marine Biology, Neuroscience Research Institute, University of California, Santa Barbara, CA, USA
| | - Cynthia Steiner
- San Diego Zoo Wildlife Alliance, Escondido, CA, 92027-7000, USA
| | - Alexander P. Sweeten
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Michael G. Tassia
- Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | - Mihir Trivedi
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Wenjie Wei
- School of Life Sciences, Westlake University, Hangzhou 310024, China
- National Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, 430070, Wuhan, China
| | - Julie Wertz
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Muyu Yang
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, PA, USA
| | - Panpan Zhang
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA
| | - Shilong Zhang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Yang Zhang
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, PA, USA
| | - Zhenmiao Zhang
- Department of Computer Science and Engineering, University of California San Diego, CA, USA
| | - Sarah A. Zhao
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Yixin Zhu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Erich D. Jarvis
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | | | - Iker Rivas-González
- Department of Primate Behavior and Evolution, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Zachary A. Szpiech
- Department of Biology, Penn State University, University Park, PA 16802, USA
| | - Christian D. Huber
- Department of Biology, Penn State University, University Park, PA 16802, USA
| | - Tobias L. Lenz
- Research Unit for Evolutionary Immunogenomics, Department of Biology, University of Hamburg, 20146 Hamburg, Germany
| | - Miriam K. Konkel
- Department of Genetics & Biochemistry, Clemson University, Clemson, SC, USA
- Center for Human Genetics, Clemson University, Greenwood, SC, USA
| | - Soojin V. Yi
- Department of Ecology, Evolution and Marine Biology, Department of Molecular, Cellular and Developmental Biology, Neuroscience Research Institute, University of California, Santa Barbara, CA, USA
| | - Stefan Canzar
- Faculty of Informatics and Data Science, University of Regensburg, 93053 Regensburg, Germany
| | - Corey T. Watson
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Louisville, Louisville, KY, USA
| | - Peter H. Sudmant
- Department of Integrative Biology, University of California, Berkeley, Berkeley, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, USA
| | - Erin Molloy
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Craig B. Lowe
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC 27710, USA
| | - Mario Ventura
- Department of Biosciences, Biotechnology and Environment, University of Bari, Bari, 70124, Italy
| | - Rachel J. O’Neill
- Institute for Systems Genomics, University of Connecticut, Storrs, CT 06269, USA
- Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, USA
- Departments of Molecular and Cell Biology, UConn Storrs, CT, USA
| | - Sergey Koren
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Kateryna D. Makova
- Department of Biology, Penn State University, University Park, PA 16802, USA
| | - Adam M. Phillippy
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| |
Collapse
|
10
|
Abstract
Genome sequences largely determine the biology and encode the history of an organism, and de novo assembly - the process of reconstructing the genome sequence of an organism from sequencing reads - has been a central problem in bioinformatics for four decades. Until recently, genomes were typically assembled into fragments of a few megabases at best, but now technological advances in long-read sequencing enable the near-complete assembly of each chromosome - also known as telomere-to-telomere assembly - for many organisms. Here, we review recent progress on assembly algorithms and protocols, with a focus on how to derive near-telomere-to-telomere assemblies. We also discuss the additional developments that will be required to resolve remaining assembly gaps and to assemble non-diploid genomes.
Collapse
Affiliation(s)
- Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Richard Durbin
- Department of Genetics, Cambridge University, Cambridge, UK.
| |
Collapse
|
11
|
Sweeten AP, Schatz MC, Phillippy AM. ModDotPlot-rapid and interactive visualization of tandem repeats. Bioinformatics 2024; 40:btae493. [PMID: 39110522 PMCID: PMC11321072 DOI: 10.1093/bioinformatics/btae493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Revised: 07/02/2024] [Accepted: 08/05/2024] [Indexed: 08/15/2024] Open
Abstract
MOTIVATION A common method for analyzing genomic repeats is to produce a sequence similarity matrix visualized via a dot plot. Innovative approaches such as StainedGlass have improved upon this classic visualization by rendering dot plots as a heatmap of sequence identity, enabling researchers to better visualize multi-megabase tandem repeat arrays within centromeres and other heterochromatic regions of the genome. However, computing the similarity estimates for heatmaps requires high computational overhead and can suffer from decreasing accuracy. RESULTS In this work, we introduce ModDotPlot, an interactive and alignment-free dot plot viewer. By approximating average nucleotide identity via a k-mer-based containment index, ModDotPlot produces accurate plots orders of magnitude faster than StainedGlass. We accomplish this through the use of a hierarchical modimizer scheme that can visualize the full 128 Mb genome of Arabidopsis thaliana in under 5 min on a laptop. ModDotPlot is bundled with a graphical user interface supporting real-time interactive navigation of entire chromosomes. AVAILABILITY AND IMPLEMENTATION ModDotPlot is available at https://github.com/marbl/ModDotPlot.
Collapse
Affiliation(s)
- Alexander P Sweeten
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, United States
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21211, United States
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21211, United States
| | - Adam M Phillippy
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, United States
| |
Collapse
|
12
|
Marçais G, DeBlasio D, Kingsford C. Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets. J Comput Biol 2024; 31:597-615. [PMID: 38980804 PMCID: PMC11304339 DOI: 10.1089/cmb.2024.0544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/11/2024] Open
Abstract
Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a decycling set of the de Bruijn graph, which is a set of unavoidable k-mers. Any long enough sequence, by definition, must contain a k-mer from any decycling set (hence, the unavoidable property). Conversely, a decycling set also defines a sketching method by choosing the k-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.
Collapse
Affiliation(s)
- Guillaume Marçais
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Dan DeBlasio
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
13
|
Sweeten AP, Schatz MC, Phillippy AM. ModDotPlot-Rapid and interactive visualization of complex repeats. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.15.589623. [PMID: 38712106 PMCID: PMC11071298 DOI: 10.1101/2024.04.15.589623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Motivation A common method for analyzing genomic repeats is to produce a sequence similarity matrix visualized via a dot plot. Innovative approaches such as StainedGlass have improved upon this classic visualization by rendering dot plots as a heatmap of sequence identity, enabling researchers to better visualize multi-megabase tandem repeat arrays within centromeres and other heterochromatic regions of the genome. However, computing the similarity estimates for heatmaps requires high computational overhead and can suffer from decreasing accuracy. Results In this work we introduce ModDotPlot, an interactive and alignment-free dot plot viewer. By approximating average nucleotide identity via a k-mer-based containment index, ModDotPlot produces accurate plots orders of magnitude faster than StainedGlass. We accomplish this through the use of a hierarchical modimizer scheme that can visualize the full 128 Mbp genome of Arabidopsis thaliana in under 5 minutes on a laptop. ModDotPlot is bundled with a graphical user interface supporting real-time interactive navigation of entire chromosomes. Availability and Implementation ModDotPlot is available at https://github.com/marbl/ModDotPlot.
Collapse
Affiliation(s)
- Alexander P Sweeten
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21211, USA
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21211, USA
| | - Adam M Phillippy
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| |
Collapse
|