1
|
Neale DB, Zimin AV, Meltzer A, Bhattarai A, Amee M, Figueroa Corona L, Allen BJ, Puiu D, Wright J, De La Torre AR, McGuire PE, Timp W, Salzberg SL, Wegrzyn JL. A genome sequence for the threatened whitebark pine. G3 (Bethesda) 2024; 14:jkae061. [PMID: 38526344 DOI: 10.1093/g3journal/jkae061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 02/29/2024] [Accepted: 03/12/2024] [Indexed: 03/26/2024]
Abstract
Whitebark pine (WBP, Pinus albicaulis) is a white pine of subalpine regions in the Western contiguous United States and Canada. WBP has become critically threatened throughout a significant part of its natural range due to mortality from the introduced fungal pathogen white pine blister rust (WPBR, Cronartium ribicola) and additional threats from mountain pine beetle (Dendroctonus ponderosae), wildfire, and maladaptation due to changing climate. Vast acreages of WBP have suffered nearly complete mortality. Genomic technologies can contribute to a faster, more cost-effective approach to the traditional practices of identifying disease-resistant, climate-adapted seed sources for restoration. With deep-coverage Illumina short reads of haploid megagametophyte tissue and Oxford Nanopore long reads of diploid needle tissue, followed by a hybrid, multistep assembly approach, we produced a final assembly containing 27.6 Gb of sequence in 92,740 contigs (N50 537,007 bp) and 34,716 scaffolds (N50 2.0 Gb). Approximately 87.2% (24.0 Gb) of total sequence was placed on the 12 WBP chromosomes. Annotation yielded 25,362 protein-coding genes, and over 77% of the genome was characterized as repeats. WBP has demonstrated the greatest variation in resistance to WPBR among the North American white pines. Candidate genes for quantitative resistance include disease resistance genes known as nucleotide-binding leucine-rich repeat receptors (NLRs). A combination of protein domain alignments and direct genome scanning was employed to fully describe the 3 subclasses of NLRs. Our high-quality reference sequence and annotation provide a marked improvement in NLR identification compared to previous assessments that leveraged de novo-assembled transcriptomes.
Collapse
Affiliation(s)
- David B Neale
- Department of Plant Sciences, University of California, Davis, CA 95616, USA
- Whitebark Pine Ecosystem Foundation, Missoula, MT 59808, USA
| | - Aleksey V Zimin
- Department of Biomedical Engineering and Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Amy Meltzer
- Department of Biomedical Engineering and Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Akriti Bhattarai
- Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT 06269, USA
| | - Maurice Amee
- Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT 06269, USA
| | | | - Brian J Allen
- Department of Plant Sciences, University of California, Davis, CA 95616, USA
- University of California Cooperative Extension, Central Sierra, Jackson, CA 95642, USA
| | - Daniela Puiu
- Department of Biomedical Engineering and Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jessica Wright
- USDA Forest Service, Pacific Southwest Research Station, Davis, CA 95618, USA
| | | | - Patrick E McGuire
- Department of Plant Sciences, University of California, Davis, CA 95616, USA
| | - Winston Timp
- Department of Biomedical Engineering and Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Steven L Salzberg
- Department of Biomedical Engineering and Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Departments of Computer Science and Biostatistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jill L Wegrzyn
- Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT 06269, USA
- Institute for Systems Genomics, University of Connecticut, Storrs, CT 06269, USA
| |
Collapse
|
2
|
Hickey G, Monlong J, Ebler J, Novak AM, Eizenga JM, Gao Y, Marschall T, Li H, Paten B, Abel HJ, Antonacci-Fulton LL, Asri M, Baid G, Baker CA, Belyaeva A, Billis K, Bourque G, Buonaiuto S, Carroll A, Chaisson MJP, Chang PC, Chang XH, Cheng H, Chu J, Cody S, Colonna V, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Doerr D, Ebert P, Ebler J, Eichler EE, Eizenga JM, Fairley S, Fedrigo O, Felsenfeld AL, Feng X, Fischer C, Flicek P, Formenti G, Frankish A, Fulton RS, Gao Y, Garg S, Garrison E, Garrison NA, Giron CG, Green RE, Groza C, Guarracino A, Haggerty L, Hall IM, Harvey WT, Haukness M, Haussler D, Heumos S, Hickey G, Hoekzema K, Hourlier T, Howe K, Jain M, Jarvis ED, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Li H, Liao WW, Lu S, Lu TY, Lucas JK, Magalhães H, Marco-Sola S, Marijon P, Markello C, Marschall T, Martin FJ, McCartney A, McDaniel J, Miga KH, Mitchell MW, Monlong J, Mountcastle J, Munson KM, Mwaniki MN, Nattestad M, Novak AM, Nurk S, Olsen HE, Olson ND, Paten B, Pesout T, Phillippy AM, Popejoy AB, Porubsky D, Prins P, Puiu D, Rautiainen M, Regier AA, Rhie A, Sacco S, Sanders AD, Schneider VA, Schultz BI, Shafin K, Sibbesen JA, Sirén J, Smith MW, Sofia HJ, Tayoun ANA, Thibaud-Nissen F, Tomlinson C, Tricomi FF, Villani F, Vollger MR, Wagner J, Walenz B, Wang T, Wood JMD, Zimin AV, Zook JM. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat Biotechnol 2024; 42:663-673. [PMID: 37165083 PMCID: PMC10638906 DOI: 10.1038/s41587-023-01793-w] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Accepted: 04/18/2023] [Indexed: 05/12/2023]
Abstract
Pangenome references address biases of reference genomes by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but advances in long-read sequencing are leading to widely available, high-quality phased assemblies. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph's ability to represent variation at different scales. Here we present the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium. The method builds graphs containing all forms of genetic variation while allowing use of current mapping and genotyping tools. We measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods. We also demonstrate construction of a Drosophila melanogaster pangenome.
Collapse
Affiliation(s)
- Glenn Hickey
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
- These authors contributed equally: Glenn Hickey, Jean Monlong
| | - Jean Monlong
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
- These authors contributed equally: Glenn Hickey, Jean Monlong
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Adam M. Novak
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Jordan M. Eizenga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Yan Gao
- Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| | | | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | | | - Haley J. Abel
- Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, USA
| | | | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | | | - Carl A. Baker
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Guillaume Bourque
- Department of Human Genetics, McGill University, Montreal, QC, Canada
- Canadian Center for Computational Genomics, McGill University, Montreal, QC, Canada
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Silvia Buonaiuto
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
| | | | - Mark J. P. Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | | | - Xian H. Chang
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Haoyu Cheng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Justin Chu
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Sarah Cody
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Vincenza Colonna
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | | | - Robert M. Cook-Deegan
- Arizona State University, Barrett and O’Connor Washington Center, Washington, DC, USA
| | - Omar E. Cornejo
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Daniel Doerr
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Peter Ebert
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Jana Ebler
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Jordan M. Eizenga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Susan Fairley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam L. Felsenfeld
- National Institutes of Health (NIH)–National Human Genome Research Institute, Bethesda, MD, USA
| | - Xiaowen Feng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Christian Fischer
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Robert S. Fulton
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Yan Gao
- Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| | - Shilpa Garg
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Copenhagen, Denmark
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Nanibaa’ A. Garrison
- Institute for Society and Genetics, College of Letters and Science, University of California, Los Angeles, Los Angeles, CA, USA
- Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Carlos Garcia Giron
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Richard E. Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
- Dovetail Genomics, Scotts Valley, CA, USA
| | - Cristian Groza
- Quantitative Life Sciences, McGill University, Montreal, QC, Canada
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Ira M. Hall
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
| | - William T. Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - David Haussler
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, Germany
| | - Glenn Hickey
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
- These authors contributed equally: Glenn Hickey, Jean Monlong
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge, UK
| | - Miten Jain
- Northeastern University, Boston, MA, USA
| | - Erich D. Jarvis
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Hanlee P. Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Eimear E. Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Barbara A. Koenig
- Program in Bioethics and Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
| | | | - Jan O. Korbel
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Jennifer Kordosky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - HoJoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Alexandra P. Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Wen-Wei Liao
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
- Division of Biology and Biomedical Sciences, Washington University School of Medicine, St. Louis, MO, USA
| | - Shuangjia Lu
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
| | - Tsung-Yu Lu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Julian K. Lucas
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Hugo Magalhães
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
- Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Pierre Marijon
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Charles Markello
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Tobias Marschall
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Fergal J. Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Ann McCartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Karen H. Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | | | - Jean Monlong
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
- These authors contributed equally: Glenn Hickey, Jean Monlong
| | | | - Katherine M. Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | | | - Adam M. Novak
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Hugh E. Olsen
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Nathan D. Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Trevor Pesout
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Adam M. Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Alice B. Popejoy
- Department of Public Health Sciences, University of California, Davis, Davis, CA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Daniela Puiu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison A. Regier
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Samuel Sacco
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Ashley D. Sanders
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Valerie A. Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Baergen I. Schultz
- National Institutes of Health (NIH)–National Human Genome Research Institute, Bethesda, MD, USA
| | | | - Jonas A. Sibbesen
- Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Jouni Sirén
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Michael W. Smith
- National Institutes of Health (NIH)–National Human Genome Research Institute, Bethesda, MD, USA
| | - Heidi J. Sofia
- National Institutes of Health (NIH)–National Human Genome Research Institute, Bethesda, MD, USA
| | - Ahmad N. Abou Tayoun
- Al Jalila Genomics Center of Excellence, Al Jalila Children’s Specialty Hospital, Dubai, UAE
- Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Chad Tomlinson
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Francesca Floriana Tricomi
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Mitchell R. Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brian Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ting Wang
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | | | - Aleksey V. Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Justin M. Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| |
Collapse
|
3
|
Neale DB, Zimin AV, Meltzer A, Bhattarai A, Amee M, Corona LF, Allen BJ, Puiu D, Wright J, Torre ARDL, McGuire PE, Timp W, Salzberg SL, Wegrzyn JL. A Genome Sequence for the Threatened Whitebark Pine. bioRxiv 2023:2023.11.16.567420. [PMID: 38014212 PMCID: PMC10680812 DOI: 10.1101/2023.11.16.567420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Whitebark pine (WBP, Pinus albicaulis ) is a white pine of subalpine regions in western contiguous US and Canada. WBP has become critically threatened throughout a significant part of its natural range due to mortality from the introduced fungal pathogen white pine blister rust (WPBR, Cronartium ribicola ) and additional threats from mountain pine beetle ( Dendroctonus ponderosae ), wildfire, and maladaptation due to changing climate. Vast acreages of WBP have suffered nearly complete mortality. Genomic technologies can contribute to a faster, more cost-effective approach to the traditional practices of identifying disease-resistant, climate-adapted seed sources for restoration. With deep-coverage Illumina short-reads of haploid megametophyte tissue and Oxford Nanopore long-reads of diploid needle tissue, followed by a hybrid, multistep assembly approach, we produced a final assembly containing 27.6 Gbp of sequence in 92,740 contigs (N50 537,007 bp) and 34,716 scaffolds (N50 2.0 Gbp). Approximately 87.2% (24.0 Gbp) of total sequence was placed on the twelve WBP chromosomes. Annotation yielded 25,362 protein-coding genes, and over 77% of the genome was characterized as repeats. WBP has demonstrated the greatest variation in resistance to WPBR among the North American white pines. Candidate genes for quantitative resistance include disease resistance genes known as nucleotide-binding leucine-rich-repeat receptors (NLRs). A combination of protein domain alignments and direct genome scanning was employed to fully describe the three subclasses of NLRs (TNL, CNL, RNL). Our high-quality reference sequence and annotation provide a marked improvement in NLR identification compared to previous assessments that leveraged de novo assembled transcriptomes.
Collapse
|
4
|
Reinhardt JA, Baker RH, Zimin AV, Ladias C, Paczolt KA, Werren JH, Hayashi CY, Wilkinson GS. Impacts of Sex Ratio Meiotic Drive on Genome Structure and Function in a Stalk-Eyed Fly. Genome Biol Evol 2023; 15:evad118. [PMID: 37364298 PMCID: PMC10319772 DOI: 10.1093/gbe/evad118] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Revised: 06/02/2023] [Accepted: 06/15/2023] [Indexed: 06/28/2023] Open
Abstract
Stalk-eyed flies in the genus Teleopsis carry selfish genetic elements that induce sex ratio (SR) meiotic drive and impact the fitness of male and female carriers. Here, we assemble and describe a chromosome-level genome assembly of the stalk-eyed fly, Teleopsis dalmanni, to elucidate patterns of divergence associated with SR. The genome contains tens of thousands of transposable element (TE) insertions and hundreds of transcriptionally and insertionally active TE families. By resequencing pools of SR and ST males using short and long reads, we find widespread differentiation and divergence between XSR and XST associated with multiple nested inversions involving most of the SR haplotype. Examination of genomic coverage and gene expression data revealed seven X-linked genes with elevated expression and coverage in SR males. The most extreme and likely drive candidate involves an XSR-specific expansion of an array of partial copies of JASPer, a gene necessary for maintenance of euchromatin and associated with regulation of TE expression. In addition, we find evidence for rapid protein evolution between XSR and XST for testis expressed and novel genes, that is, either recent duplicates or lacking a Dipteran ortholog, including an X-linked duplicate of maelstrom, which is also involved in TE silencing. Overall, the evidence suggests that this ancient XSR polymorphism has had a variety of impacts on repetitive DNA and its regulation in this species.
Collapse
Affiliation(s)
| | - Richard H Baker
- Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, New York, USA
| | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, USA
| | - Chloe Ladias
- Biology Department, State University of New York at Geneseo, Geneseo, New York, USA
| | - Kimberly A Paczolt
- Department of Biology, University of Maryland, College Park, Maryland, USA
| | - John H Werren
- Department of Biology, University of Rochester, Rochester, New York, USA
| | - Cheryl Y Hayashi
- Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, New York, USA
| | - Gerald S Wilkinson
- Department of Biology, University of Maryland, College Park, Maryland, USA
| |
Collapse
|
5
|
Liao WW, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, Lu S, Lucas JK, Monlong J, Abel HJ, Buonaiuto S, Chang XH, Cheng H, Chu J, Colonna V, Eizenga JM, Feng X, Fischer C, Fulton RS, Garg S, Groza C, Guarracino A, Harvey WT, Heumos S, Howe K, Jain M, Lu TY, Markello C, Martin FJ, Mitchell MW, Munson KM, Mwaniki MN, Novak AM, Olsen HE, Pesout T, Porubsky D, Prins P, Sibbesen JA, Sirén J, Tomlinson C, Villani F, Vollger MR, Antonacci-Fulton LL, Baid G, Baker CA, Belyaeva A, Billis K, Carroll A, Chang PC, Cody S, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Ebert P, Fairley S, Fedrigo O, Felsenfeld AL, Formenti G, Frankish A, Gao Y, Garrison NA, Giron CG, Green RE, Haggerty L, Hoekzema K, Hourlier T, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Magalhães H, Marco-Sola S, Marijon P, McCartney A, McDaniel J, Mountcastle J, Nattestad M, Nurk S, Olson ND, Popejoy AB, Puiu D, Rautiainen M, Regier AA, Rhie A, Sacco S, Sanders AD, Schneider VA, Schultz BI, Shafin K, Smith MW, Sofia HJ, Abou Tayoun AN, Thibaud-Nissen F, Tricomi FF, Wagner J, Walenz B, Wood JMD, Zimin AV, Bourque G, Chaisson MJP, Flicek P, Phillippy AM, Zook JM, Eichler EE, Haussler D, Wang T, Jarvis ED, Miga KH, Garrison E, Marschall T, Hall IM, Li H, Paten B. A draft human pangenome reference. Nature 2023; 617:312-324. [PMID: 37165242 PMCID: PMC10172123 DOI: 10.1038/s41586-023-05896-x] [Citation(s) in RCA: 170] [Impact Index Per Article: 170.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Accepted: 02/28/2023] [Indexed: 05/12/2023]
Abstract
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
Collapse
Affiliation(s)
- Wen-Wei Liao
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
- Division of Biology and Biomedical Sciences, Washington University School of Medicine, St. Louis, MO, USA
| | - Mobin Asri
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Daniel Doerr
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Marina Haukness
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Shuangjia Lu
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
| | - Julian K Lucas
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Jean Monlong
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Haley J Abel
- Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, USA
| | - Silvia Buonaiuto
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
| | - Xian H Chang
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Haoyu Cheng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Justin Chu
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Vincenza Colonna
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Xiaowen Feng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Christian Fischer
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Robert S Fulton
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Shilpa Garg
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Copenhagen, Denmark
| | - Cristian Groza
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, Germany
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge, UK
| | - Miten Jain
- Northeastern University, Boston, MA, USA
| | - Tsung-Yu Lu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Charles Markello
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Adam M Novak
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Hugh E Olsen
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Trevor Pesout
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jonas A Sibbesen
- Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Jouni Sirén
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Chad Tomlinson
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA, USA
| | | | | | - Carl A Baker
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | | | - Sarah Cody
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | | | - Robert M Cook-Deegan
- Barrett and O'Connor Washington Center, Arizona State University, Washington, DC, USA
| | - Omar E Cornejo
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA
| | - Mark Diekhans
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
| | - Susan Fairley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam L Felsenfeld
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Yan Gao
- Center for Computational and Genomic Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Nanibaa' A Garrison
- Institute for Society and Genetics, College of Letters and Science, University of California, Los Angeles, CA, USA
- Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
- Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Carlos Garcia Giron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Richard E Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA
- Dovetail Genomics, Scotts Valley, CA, USA
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Barbara A Koenig
- Program in Bioethics and Institute for Human Genetics, University of California, San Francisco, CA, USA
| | | | - Jan O Korbel
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Jennifer Kordosky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - HoJoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Hugo Magalhães
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
- Departament d'Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Pierre Marijon
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Ann McCartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | | | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Alice B Popejoy
- Department of Public Health Sciences, University of California, Davis, CA, USA
| | - Daniela Puiu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison A Regier
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Samuel Sacco
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA
| | - Ashley D Sanders
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Baergen I Schultz
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | | | - Michael W Smith
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Heidi J Sofia
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Ahmad N Abou Tayoun
- Al Jalila Genomics Center of Excellence, Al Jalila Children's Specialty Hospital, Dubai, UAE
- Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Francesca Floriana Tricomi
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brian Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Guillaume Bourque
- Department of Human Genetics, McGill University, Montréal, Québec, Canada
- Canadian Center for Computational Genomics, McGill University, Montréal, Québec, Canada
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - David Haussler
- Genomics Institute, University of California, Santa Cruz, CA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Ting Wang
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Erich D Jarvis
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Karen H Miga
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA.
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany.
| | - Ira M Hall
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA.
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA.
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, CA, USA.
| |
Collapse
|
6
|
Chao KH, Zimin AV, Pertea M, Salzberg SL. The first gapless, reference-quality, fully annotated genome from a Southern Han Chinese individual. G3 (Bethesda) 2023; 13:jkac321. [PMID: 36630290 PMCID: PMC9997556 DOI: 10.1093/g3journal/jkac321] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Revised: 10/27/2022] [Accepted: 11/03/2022] [Indexed: 01/12/2023]
Abstract
We used long-read DNA sequencing to assemble the genome of a Southern Han Chinese male. We organized the sequence into chromosomes and filled in gaps using the recently completed T2T-CHM13 genome as a guide, yielding a gap-free genome, Han1, containing 3,099,707,698 bases. Using the T2T-CHM13 annotation as a reference, we mapped all genes onto the Han1 genome and identified additional gene copies, generating a total of 60,708 putative genes, of which 20,003 are protein-coding. A comprehensive comparison between the genes revealed that 235 protein-coding genes were substantially different between the individuals, with frameshifts or truncations affecting the protein-coding sequence. Most of these were heterozygous variants in which one gene copy was unaffected. This represents the first gene-level comparison between two finished, annotated individual human genomes.
Collapse
Affiliation(s)
- Kuan-Hao Chao
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Aleksey V Zimin
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Steven L Salzberg
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21211, USA
| |
Collapse
|
7
|
Guo A, Salzberg SL, Zimin AV. JASPER: A fast genome polishing tool that improves accuracy of genome assemblies. PLoS Comput Biol 2023; 19:e1011032. [PMID: 37000853 PMCID: PMC10096238 DOI: 10.1371/journal.pcbi.1011032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 04/12/2023] [Accepted: 03/16/2023] [Indexed: 04/03/2023] Open
Abstract
Advances in long-read sequencing technologies have dramatically improved the contiguity and completeness of genome assemblies. Using the latest nanopore-based sequencers, we can generate enough data for the assembly of a human genome from a single flow cell. With the long-read data from these sequences, we can now routinely produce de novo genome assemblies in which half or more of a genome is contained in megabase-scale contigs. Assemblies produced from nanopore data alone, though, have relatively high error rates and can benefit from a process called polishing, in which more-accurate reads are used to correct errors in the consensus sequence. In this manuscript, we present a novel tool for genome polishing called JASPER (Jellyfish-based Assembly Sequence Polisher for Error Reduction). In contrast to many other polishing methods, JASPER gains efficiency by avoiding the alignment of reads to the assembly. Instead, JASPER uses a database of k-mer counts that it creates from the reads to detect and correct errors in the consensus. Our experiments demonstrate that JASPER is faster than alignment-based polishers, and both faster and more accurate than other k-mer based polishing methods. We also introduce the idea of using a polishing tool to create population-specific reference genomes, and illustrate this idea using sequence data from multiple individuals from Tokyo, Japan.
Collapse
Affiliation(s)
- Alina Guo
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Steven L. Salzberg
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Aleksey V. Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
| |
Collapse
|
8
|
Miller J, Zimin AV, Gordus A. Chromosome-level genome and the identification of sex chromosomes in Uloborus diversus. Gigascience 2022; 12:giad002. [PMID: 36762707 PMCID: PMC9912274 DOI: 10.1093/gigascience/giad002] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Revised: 11/18/2022] [Accepted: 01/03/2023] [Indexed: 02/11/2023] Open
Abstract
The orb web is a remarkable example of animal architecture that is observed in families of spiders that diverged over 200 million years ago. While several genomes exist for araneid orb-weavers, none exist for other orb-weaving families, hampering efforts to investigate the genetic basis of this complex behavior. Here we present a chromosome-level genome assembly for the cribellate orb-weaving spider Uloborus diversus. The assembly reinforces evidence of an ancient arachnid genome duplication and identifies complete open reading frames for every class of spidroin gene, which encode the proteins that are the key structural components of spider silks. We identified the 2 X chromosomes for U. diversus and identify candidate sex-determining loci. This chromosome-level assembly will be a valuable resource for evolutionary research into the origins of orb-weaving, spidroin evolution, chromosomal rearrangement, and chromosomal sex determination in spiders.
Collapse
Affiliation(s)
- Jeremiah Miller
- Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Andrew Gordus
- Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
9
|
Abstract
Kraken and KrakenUniq are widely-used tools for classifying metagenomics sequences. A key requirement for these systems is a database containing all k-mers from all genomes that the users want to be able to detect, where k = 31 by default. This database can be very large, easily exceeding 100 gigabytes (GB) and sometimes 400 GB. Previously, Kraken and KrakenUniq required loading the entire database into main memory (RAM), and if RAM was insufficient, they used memory mapping, which significantly increased the running time for large datasets. We have implemented a new algorithm in KrakenUniq that allows it to load and process the database in chunks, with only a modest increase in running time. This enhancement now makes it feasible to run KrakenUniq on very large datasets and huge databases on virtually any computer, even a laptop, while providing the same very high classification accuracy as the previous system. Statement of need The KrakenUniq software classifies reads from metagenomic samples to establish which organisms are present in the samples and estimate their abundance. The software is widely used used by researchers and clinicians in medical diagnostics, microbiome and environmental studies.Typical databases used by KrakenUniq are tens to hundreds of gigabytes in size. The original KrakenUniq code required loading the entire database in RAM, which demanded expensive high-memory servers to run it efficiently. If a user did not have enough physical RAM to load the entire database, KrakenUniq resorted to memory-mapping the database, which significantly increased run times, frequently by a factor of more than 100. The new functionality described in this paper enables users who do not have access to high-memory servers to run KrakenUniq efficiently, with a CPU time performance increase of 3 to 4-fold, down from 100+.
Collapse
Affiliation(s)
- Christopher Pockrandt
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Aleksey V Zimin
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Steven L Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
10
|
Jarvis ED, Formenti G, Rhie A, Guarracino A, Yang C, Wood J, Tracey A, Thibaud-Nissen F, Vollger MR, Porubsky D, Cheng H, Asri M, Logsdon GA, Carnevali P, Chaisson MJP, Chin CS, Cody S, Collins J, Ebert P, Escalona M, Fedrigo O, Fulton RS, Fulton LL, Garg S, Gerton JL, Ghurye J, Granat A, Green RE, Harvey W, Hasenfeld P, Hastie A, Haukness M, Jaeger EB, Jain M, Kirsche M, Kolmogorov M, Korbel JO, Koren S, Korlach J, Lee J, Li D, Lindsay T, Lucas J, Luo F, Marschall T, Mitchell MW, McDaniel J, Nie F, Olsen HE, Olson ND, Pesout T, Potapova T, Puiu D, Regier A, Ruan J, Salzberg SL, Sanders AD, Schatz MC, Schmitt A, Schneider VA, Selvaraj S, Shafin K, Shumate A, Stitziel NO, Stober C, Torrance J, Wagner J, Wang J, Wenger A, Xiao C, Zimin AV, Zhang G, Wang T, Li H, Garrison E, Haussler D, Hall I, Zook JM, Eichler EE, Phillippy AM, Paten B, Howe K, Miga KH. Semi-automated assembly of high-quality diploid human reference genomes. Nature 2022; 611:519-531. [PMID: 36261518 PMCID: PMC9668749 DOI: 10.1038/s41586-022-05325-5] [Citation(s) in RCA: 66] [Impact Index Per Article: 33.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Accepted: 09/06/2022] [Indexed: 01/01/2023]
Abstract
The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.
Collapse
Affiliation(s)
- Erich D. Jarvis
- grid.134907.80000 0001 2166 1519Vertebrate Genome Laboratory, The Rockefeller University, New York, NY USA ,grid.413575.10000 0001 2167 1581Howard Hughes Medical Institute, Chevy Chase, MD USA
| | - Giulio Formenti
- grid.134907.80000 0001 2166 1519Vertebrate Genome Laboratory, The Rockefeller University, New York, NY USA
| | - Arang Rhie
- grid.94365.3d0000 0001 2297 5165Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD USA
| | - Andrea Guarracino
- grid.510779.d0000 0004 9414 6915Genomics Research Centre, Human Technopole, Viale Rita Levi-Montalcini, Milan, Italy
| | - Chentao Yang
- grid.21155.320000 0001 2034 1839BGI-Shenzhen, Shenzhen, China
| | - Jonathan Wood
- grid.10306.340000 0004 0606 5382Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Alan Tracey
- grid.10306.340000 0004 0606 5382Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Francoise Thibaud-Nissen
- grid.94365.3d0000 0001 2297 5165National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD USA
| | - Mitchell R. Vollger
- grid.34477.330000000122986657Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA USA
| | - David Porubsky
- grid.34477.330000000122986657Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA USA
| | - Haoyu Cheng
- grid.65499.370000 0001 2106 9910Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA USA ,grid.38142.3c000000041936754XDepartment of Biomedical Informatics, Harvard Medical School, Boston, MA USA
| | - Mobin Asri
- grid.205975.c0000 0001 0740 6917UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA USA
| | - Glennis A. Logsdon
- grid.34477.330000000122986657Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA USA
| | - Paolo Carnevali
- grid.507326.50000 0004 6090 4941Chan Zuckerberg Initiative, Redwood City, CA USA
| | - Mark J. P. Chaisson
- grid.42505.360000 0001 2156 6853Quantitative and Computational Biology, University of Southern California, Los Angeles, CA USA
| | | | - Sarah Cody
- grid.4367.60000 0001 2355 7002McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO USA
| | - Joanna Collins
- grid.10306.340000 0004 0606 5382Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Peter Ebert
- grid.411327.20000 0001 2176 9917Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
| | - Merly Escalona
- grid.205975.c0000 0001 0740 6917Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA USA
| | - Olivier Fedrigo
- grid.134907.80000 0001 2166 1519Vertebrate Genome Laboratory, The Rockefeller University, New York, NY USA
| | - Robert S. Fulton
- grid.4367.60000 0001 2355 7002McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO USA
| | - Lucinda L. Fulton
- grid.4367.60000 0001 2355 7002McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO USA
| | - Shilpa Garg
- grid.5254.60000 0001 0674 042XDepartment of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Jennifer L. Gerton
- grid.250820.d0000 0000 9420 1591Stowers Institute for Medical Research, Kansas City, MO USA
| | - Jay Ghurye
- grid.504403.6Dovetail Genomics, Scotts Valley, CA USA
| | | | - Richard E. Green
- grid.205975.c0000 0001 0740 6917UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA USA
| | - William Harvey
- grid.34477.330000000122986657Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA USA
| | - Patrick Hasenfeld
- grid.4709.a0000 0004 0495 846XEuropean Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Alex Hastie
- grid.470262.50000 0004 0473 1353Bionano Genomics, San Diego, CA USA
| | - Marina Haukness
- grid.205975.c0000 0001 0740 6917UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA USA
| | - Erich B. Jaeger
- grid.185669.50000 0004 0507 3954Illumina, Inc., San Diego, CA USA
| | - Miten Jain
- grid.205975.c0000 0001 0740 6917UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA USA
| | - Melanie Kirsche
- grid.21107.350000 0001 2171 9311Department of Computer Science, Johns Hopkins University, Baltimore, MD USA
| | - Mikhail Kolmogorov
- grid.266100.30000 0001 2107 4242Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA USA
| | - Jan O. Korbel
- grid.4709.a0000 0004 0495 846XEuropean Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Sergey Koren
- grid.94365.3d0000 0001 2297 5165Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD USA
| | - Jonas Korlach
- grid.423340.20000 0004 0640 9878Pacific Biosciences, Menlo Park, CA USA
| | - Joyce Lee
- grid.470262.50000 0004 0473 1353Bionano Genomics, San Diego, CA USA
| | - Daofeng Li
- grid.4367.60000 0001 2355 7002Department of Genetics, Washington University School of Medicine, St. Louis, MO USA ,grid.4367.60000 0001 2355 7002The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO USA
| | - Tina Lindsay
- grid.4367.60000 0001 2355 7002McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO USA
| | - Julian Lucas
- grid.205975.c0000 0001 0740 6917UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA USA
| | - Feng Luo
- grid.26090.3d0000 0001 0665 0280School of Computing, Clemson University, Clemson, SC USA
| | - Tobias Marschall
- grid.411327.20000 0001 2176 9917Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
| | - Matthew W. Mitchell
- grid.282012.b0000 0004 0627 5048Coriell Institute for Medical Research, Camden, NJ USA
| | - Jennifer McDaniel
- grid.94225.38000000012158463XMaterial Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD USA
| | - Fan Nie
- grid.216417.70000 0001 0379 7164Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Hugh E. Olsen
- grid.205975.c0000 0001 0740 6917UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA USA
| | - Nathan D. Olson
- grid.94225.38000000012158463XMaterial Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD USA
| | - Trevor Pesout
- grid.205975.c0000 0001 0740 6917UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA USA
| | - Tamara Potapova
- grid.250820.d0000 0000 9420 1591Stowers Institute for Medical Research, Kansas City, MO USA
| | - Daniela Puiu
- grid.21107.350000 0001 2171 9311Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD USA
| | - Allison Regier
- grid.511991.40000 0004 4910 5831DNAnexus, Mountain View, CA USA
| | - Jue Ruan
- grid.410727.70000 0001 0526 1937Agricultural Genomics Institute, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Steven L. Salzberg
- grid.21107.350000 0001 2171 9311Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD USA
| | - Ashley D. Sanders
- grid.419491.00000 0001 1014 0849Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany
| | - Michael C. Schatz
- grid.21107.350000 0001 2171 9311Department of Computer Science, Johns Hopkins University, Baltimore, MD USA
| | | | - Valerie A. Schneider
- grid.94365.3d0000 0001 2297 5165National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD USA
| | | | - Kishwar Shafin
- grid.205975.c0000 0001 0740 6917UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA USA
| | - Alaina Shumate
- grid.21107.350000 0001 2171 9311Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD USA
| | - Nathan O. Stitziel
- grid.4367.60000 0001 2355 7002McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO USA ,grid.4367.60000 0001 2355 7002Department of Genetics, Washington University School of Medicine, St. Louis, MO USA ,grid.4367.60000 0001 2355 7002Cardiovascular Division, John T. Milliken Department of Internal Medicine, Washington University School of Medicine, St. Louis, USA
| | - Catherine Stober
- grid.4709.a0000 0004 0495 846XEuropean Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - James Torrance
- grid.10306.340000 0004 0606 5382Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Justin Wagner
- grid.94225.38000000012158463XMaterial Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD USA
| | - Jianxin Wang
- grid.216417.70000 0001 0379 7164Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Aaron Wenger
- grid.423340.20000 0004 0640 9878Pacific Biosciences, Menlo Park, CA USA
| | - Chuanle Xiao
- grid.12981.330000 0001 2360 039XState Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Aleksey V. Zimin
- grid.21107.350000 0001 2171 9311Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD USA
| | - Guojie Zhang
- grid.13402.340000 0004 1759 700XCenter for Evolutionary & Organismal Biology, Zhejiang University School of Medicine, Hangzhou, China
| | - Ting Wang
- grid.4367.60000 0001 2355 7002McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO USA ,grid.4367.60000 0001 2355 7002Department of Genetics, Washington University School of Medicine, St. Louis, MO USA ,grid.4367.60000 0001 2355 7002The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO USA
| | - Heng Li
- grid.65499.370000 0001 2106 9910Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA USA
| | - Erik Garrison
- grid.267301.10000 0004 0386 9246Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN USA
| | - David Haussler
- grid.413575.10000 0001 2167 1581Howard Hughes Medical Institute, Chevy Chase, MD USA ,grid.205975.c0000 0001 0740 6917Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA USA
| | - Ira Hall
- grid.47100.320000000419368710Yale School of Medicine, New Haven, CT USA
| | - Justin M. Zook
- grid.94225.38000000012158463XMaterial Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD USA
| | - Evan E. Eichler
- grid.413575.10000 0001 2167 1581Howard Hughes Medical Institute, Chevy Chase, MD USA ,grid.34477.330000000122986657Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA USA
| | - Adam M. Phillippy
- grid.94365.3d0000 0001 2297 5165Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD USA
| | - Benedict Paten
- grid.205975.c0000 0001 0740 6917UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA USA
| | - Kerstin Howe
- grid.10306.340000 0004 0606 5382Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Karen H. Miga
- grid.205975.c0000 0001 0740 6917UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA USA
| | | |
Collapse
|
11
|
Sork VL, Cokus SJ, Fitz-Gibbon ST, Zimin AV, Puiu D, Garcia JA, Gugger PF, Henriquez CL, Zhen Y, Lohmueller KE, Pellegrini M, Salzberg SL. High-quality genome and methylomes illustrate features underlying evolutionary success of oaks. Nat Commun 2022; 13:2047. [PMID: 35440538 PMCID: PMC9018854 DOI: 10.1038/s41467-022-29584-y] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Accepted: 03/11/2022] [Indexed: 02/01/2023] Open
Abstract
The genus Quercus, which emerged ∼55 million years ago during globally warm temperatures, diversified into ∼450 extant species. We present a high-quality de novo genome assembly of a California endemic oak, Quercus lobata, revealing features consistent with oak evolutionary success. Effective population size remained large throughout history despite declining since early Miocene. Analysis of 39,373 mapped protein-coding genes outlined copious duplications consistent with genetic and phenotypic diversity, both by retention of genes created during the ancient γ whole genome hexaploid duplication event and by tandem duplication within families, including numerous resistance genes and a very large block of duplicated DUF247 genes, which have been found to be associated with self-incompatibility in grasses. An additional surprising finding is that subcontext-specific patterns of DNA methylation associated with transposable elements reveal broadly-distributed heterochromatin in intergenic regions, similar to grasses. Collectively, these features promote genetic and phenotypic variation that would facilitate adaptability to changing environments.
Collapse
Affiliation(s)
- Victoria L Sork
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA, 90095-1438, USA.
- Institute of the Environment and Sustainability, University of California, Los Angeles, CA, 90095, USA.
| | - Shawn J Cokus
- Department of Molecular, Cell, and Developmental Biology, University of California, Los Angeles, CA, 90095-7239, USA
| | - Sorel T Fitz-Gibbon
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA, 90095-1438, USA
| | - Aleksey V Zimin
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Daniela Puiu
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Jesse A Garcia
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA, 90095-1438, USA
| | - Paul F Gugger
- Appalachian Laboratory, University of Maryland Center for Environmental Science, Frostburg, MD, 21532, USA
| | - Claudia L Henriquez
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA, 90095-1438, USA
| | - Ying Zhen
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA, 90095-1438, USA
| | - Kirk E Lohmueller
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA, 90095-1438, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA
| | - Matteo Pellegrini
- Department of Molecular, Cell, and Developmental Biology, University of California, Los Angeles, CA, 90095-7239, USA
| | - Steven L Salzberg
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
- Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, 21218, USA
| |
Collapse
|
12
|
Zimin AV, Shumate A, Shinder I, Heinz J, Puiu D, Pertea M, Salzberg SL. A reference-quality, fully annotated genome from a Puerto Rican individual. Genetics 2022; 220:iyab227. [PMID: 34897437 PMCID: PMC9097244 DOI: 10.1093/genetics/iyab227] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Accepted: 11/05/2021] [Indexed: 11/12/2022] Open
Abstract
Until 2019, the human genome was available in only one fully annotated version, GRCh38, which was the result of 18 years of continuous improvement and revision. Despite dramatic improvements in sequencing technology, no other genome was available as an annotated reference until 2019, when the genome of an Ashkenazi individual, Ash1, was released. In this study, we describe the assembly and annotation of a second individual genome, from a Puerto Rican individual whose DNA was collected as part of the Human Pangenome project. The new genome, called PR1, is the first true reference genome created from an individual of African descent. Due to recent improvements in both sequencing and assembly technology, and particularly to the use of the recently completed CHM13 human genome as a guide to assembly, PR1 is more complete and more contiguous than either GRCh38 or Ash1. Annotation revealed 37,755 genes (of which 19,999 are protein coding), including 12 additional gene copies that are present in PR1 and missing from CHM13. Fifty-seven genes have fewer copies in PR1 than in CHM13, 9 map only partially, and 3 genes (all noncoding) from CHM13 are entirely missing from PR1.
Collapse
Affiliation(s)
- Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Ida Shinder
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Cross Disciplinary Graduate Program in Biomedical Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21218, USA
| | - Jakob Heinz
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Daniela Puiu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Mihaela Pertea
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Steven L Salzberg
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
13
|
Zimin AV, Salzberg SL. The SAMBA tool uses long reads to improve the contiguity of genome assemblies. PLoS Comput Biol 2022; 18:e1009860. [PMID: 35120119 PMCID: PMC8849508 DOI: 10.1371/journal.pcbi.1009860] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 02/16/2022] [Accepted: 01/24/2022] [Indexed: 01/03/2023] Open
Abstract
Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca. The DNA molecule that is in almost every cell in a living organism can be represented as sequence of four different nucleotides, or bases denoted by letters A,C,G, and T. The current sequencing technologies require breaking the DNA molecule into short fragments, sequencing them to find the corresponding sequence of letters, producing “reads”, and assembly, which recovered the DNA sequence from the reads. Repeats in the genome sequences typically prevented one from recovering full contiguous genome sequence because any repeat that is longer than the size of the read cannot be reliably resolved. Third-generation sequencing technologies can generate very long reads albeit with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using previous-generation reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. Here we introduce a tool called SAMBA that is designed to upgrade existing assemblies using additional coverage with long-read data, resulting in substantially greater contiguity. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca.
Collapse
Affiliation(s)
- Aleksey V. Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
- * E-mail:
| | - Steven L. Salzberg
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, United States of America
| |
Collapse
|
14
|
Neale DB, Zimin AV, Zaman S, Scott AD, Shrestha B, Workman RE, Puiu D, Allen BJ, Moore ZJ, Sekhwal MK, De La Torre AR, McGuire PE, Burns E, Timp W, Wegrzyn JL, Salzberg SL. Assembled and annotated 26.5 Gbp coast redwood genome: a resource for estimating evolutionary adaptive potential and investigating hexaploid origin. G3 (Bethesda) 2022; 12:6460957. [PMID: 35100403 PMCID: PMC8728005 DOI: 10.1093/g3journal/jkab380] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Accepted: 10/25/2021] [Indexed: 12/15/2022]
Abstract
Sequencing, assembly, and annotation of the 26.5 Gbp hexaploid genome of coast redwood (Sequoia sempervirens) was completed leading toward discovery of genes related to climate adaptation and investigation of the origin of the hexaploid genome. Deep-coverage short-read Illumina sequencing data from haploid tissue from a single seed were combined with long-read Oxford Nanopore Technologies sequencing data from diploid needle tissue to create an initial assembly, which was then scaffolded using proximity ligation data to produce a highly contiguous final assembly, SESE 2.1, with a scaffold N50 size of 44.9 Mbp. The assembly included several scaffolds that span entire chromosome arms, confirmed by the presence of telomere and centromere sequences on the ends of the scaffolds. The structural annotation produced 118,906 genes with 113 containing introns that exceed 500 Kbp in length and one reaching 2 Mb. Nearly 19 Gbp of the genome represented repetitive content with the vast majority characterized as long terminal repeats, with a 2.9:1 ratio of Copia to Gypsy elements that may aid in gene expression control. Comparison of coast redwood to other conifers revealed species-specific expansions for a plethora of abiotic and biotic stress response genes, including those involved in fungal disease resistance, detoxification, and physical injury/structural remodeling and others supporting flavonoid biosynthesis. Analysis of multiple genes that exist in triplicate in coast redwood but only once in its diploid relative, giant sequoia, supports a previous hypothesis that the hexaploidy is the result of autopolyploidy rather than any hybridizations with separate but closely related conifer species.
Collapse
Affiliation(s)
- David B Neale
- Department of Plant Sciences, University of California, Davis, Davis, CA 95616, USA
| | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA
| | - Sumaira Zaman
- Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT 06269, USA.,Department of Computer Science & Engineering, University of Connecticut, Storrs, CT 06269, USA
| | - Alison D Scott
- Department of Plant Sciences, University of California, Davis, Davis, CA 95616, USA
| | - Bikash Shrestha
- Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT 06269, USA
| | - Rachael E Workman
- Department of Molecular Biology and Genetics, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Daniela Puiu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA
| | - Brian J Allen
- Department of Plant Sciences, University of California, Davis, Davis, CA 95616, USA
| | - Zane J Moore
- Department of Plant Sciences, University of California, Davis, Davis, CA 95616, USA
| | - Manoj K Sekhwal
- School of Forestry, Northern Arizona University, Flagstaff, AZ 86011, USA
| | | | - Patrick E McGuire
- Department of Plant Sciences, University of California, Davis, Davis, CA 95616, USA
| | - Emily Burns
- Save the Redwoods League, San Francisco, CA 94104, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA.,Department of Molecular Biology and Genetics, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Jill L Wegrzyn
- Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT 06269, USA.,Institute for Systems Genomics, University of Connecticut, Storrs, CT 06269, USA
| | - Steven L Salzberg
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA.,Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA
| |
Collapse
|
15
|
Polinski JM, Zimin AV, Clark KF, Kohn AB, Sadowski N, Timp W, Ptitsyn A, Khanna P, Romanova DY, Williams P, Greenwood SJ, Moroz LL, Walt DR, Bodnar AG. The American lobster genome reveals insights on longevity, neural, and immune adaptations. Sci Adv 2021; 7:7/26/eabe8290. [PMID: 34162536 PMCID: PMC8221624 DOI: 10.1126/sciadv.abe8290] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Accepted: 05/07/2021] [Indexed: 05/30/2023]
Abstract
The American lobster, Homarus americanus, is integral to marine ecosystems and supports an important commercial fishery. This iconic species also serves as a valuable model for deciphering neural networks controlling rhythmic motor patterns and olfaction. Here, we report a high-quality draft assembly of the H. americanus genome with 25,284 predicted gene models. Analysis of the neural gene complement revealed extraordinary development of the chemosensory machinery, including a profound diversification of ligand-gated ion channels and secretory molecules. The discovery of a novel class of chimeric receptors coupling pattern recognition and neurotransmitter binding suggests a deep integration between the neural and immune systems. A robust repertoire of genes involved in innate immunity, genome stability, cell survival, chemical defense, and cuticle formation represents a diversity of defense mechanisms essential to thrive in the benthic marine environment. Together, these unique evolutionary adaptations contribute to the longevity and ecological success of this long-lived benthic predator.
Collapse
Affiliation(s)
| | - Aleksey V Zimin
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
| | - K Fraser Clark
- Department of Animal Science and Aquaculture, Dalhousie University, Truro, Nova Scotia B2N 5E3, Canada
| | - Andrea B Kohn
- The Whitney Laboratory for Marine Bioscience and Department of Neuroscience, University of Florida, Gainesville and St. Augustine, FL 32080-8623, USA
| | - Norah Sadowski
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Winston Timp
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Andrey Ptitsyn
- Gloucester Marine Genomics Institute, Gloucester, MA 01930, USA
| | - Prarthana Khanna
- Genetics Program, Tufts University School of Medicine, Boston, MA 02111, USA
| | - Daria Y Romanova
- Institute of Higher Nervous Activity and Neurophysiology of RAS, Moscow 117485, Russia
| | - Peter Williams
- The Whitney Laboratory for Marine Bioscience and Department of Neuroscience, University of Florida, Gainesville and St. Augustine, FL 32080-8623, USA
| | - Spencer J Greenwood
- Department of Biomedical Sciences, Atlantic Veterinary College, University of Prince Edward Island, Charlottetown, Prince Edward Island C1A 4P3, Canada
| | - Leonid L Moroz
- The Whitney Laboratory for Marine Bioscience and Department of Neuroscience, University of Florida, Gainesville and St. Augustine, FL 32080-8623, USA
| | - David R Walt
- Gloucester Marine Genomics Institute, Gloucester, MA 01930, USA
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Wyss Institute for Biologically Inspired Engineering at Harvard University, Boston, MA 02115, USA
| | - Andrea G Bodnar
- Gloucester Marine Genomics Institute, Gloucester, MA 01930, USA.
| |
Collapse
|
16
|
Alonge M, Shumate A, Puiu D, Zimin AV, Salzberg SL. Chromosome-Scale Assembly of the Bread Wheat Genome Reveals Thousands of Additional Gene Copies. Genetics 2020; 216:599-608. [PMID: 32796007 PMCID: PMC7536849 DOI: 10.1534/genetics.120.303501] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2020] [Accepted: 08/10/2020] [Indexed: 11/18/2022] Open
Abstract
Bread wheat (Triticum aestivum) is a major food crop and an important plant system for agricultural genetics research. However, due to the complexity and size of its allohexaploid genome, genomic resources are limited compared to other major crops. The IWGSC recently published a reference genome and associated annotation (IWGSC CS v1.0, Chinese Spring) that has been widely adopted and utilized by the wheat community. Although this reference assembly represents all three wheat subgenomes at chromosome-scale, it was derived from short reads, and thus is missing a substantial portion of the expected 16 Gbp of genomic sequence. We earlier published an independent wheat assembly (Triticum_aestivum_3.1, Chinese Spring) that came much closer in length to the expected genome size, although it was only a contig-level assembly lacking gene annotations. Here, we describe a reference-guided effort to scaffold those contigs into chromosome-length pseudomolecules, add in any missing sequence that was unique to the IWGSC CS v1.0 assembly, and annotate the resulting pseudomolecules with genes. Our updated assembly, Triticum_aestivum_4.0, contains 15.07 Gbp of nongap sequence anchored to chromosomes, which is 1.2 Gbps more than the previous reference assembly. It includes 108,639 genes unambiguously localized to chromosomes, including over 2000 genes that were previously unplaced. We also discovered >5700 additional gene copies, facilitating the accurate annotation of functional gene duplications including at the Ppd-B1 photoperiod response locus.
Collapse
Affiliation(s)
- Michael Alonge
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218
| | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21211
| | - Daniela Puiu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21211
| | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21211
| | - Steven L Salzberg
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21211
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland 21205
| |
Collapse
|
17
|
Zimin AV, Salzberg SL. The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS Comput Biol 2020; 16:e1007981. [PMID: 32589667 PMCID: PMC7347232 DOI: 10.1371/journal.pcbi.1007981] [Citation(s) in RCA: 121] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 07/09/2020] [Accepted: 05/25/2020] [Indexed: 11/18/2022] Open
Abstract
The introduction of third-generation DNA sequencing technologies in recent years has allowed scientists to generate dramatically longer sequence reads, which when used in whole-genome sequencing projects have yielded better repeat resolution and far more contiguous genome assemblies. While the promise of better contiguity has held true, the relatively high error rate of long reads, averaging 8–15%, has made it challenging to generate a highly accurate final sequence. Current long-read sequencing technologies display a tendency toward systematic errors, in particular in homopolymer regions, which present additional challenges. A cost-effective strategy to generate highly contiguous assemblies with a very low overall error rate is to combine long reads with low-cost short-read data, which currently have an error rate below 0.5%. This hybrid strategy can be pursued either by incorporating the short-read data into the early phase of assembly, during the read correction step, or by using short reads to “polish” the consensus built from long reads. In this report, we present the assembly polishing tool POLCA (POLishing by Calling Alternatives) and compare its performance with two other popular polishing programs, Pilon and Racon. We show that on simulated data POLCA is more accurate than Pilon, and comparable in accuracy to Racon. On real data, all three programs show similar performance, but POLCA is consistently much faster than either of the other polishing programs.
Collapse
Affiliation(s)
- Aleksey V. Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
- * E-mail:
| | - Steven L. Salzberg
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, United States of America
| |
Collapse
|
18
|
Shumate A, Zimin AV, Sherman RM, Puiu D, Wagner JM, Olson ND, Pertea M, Salit ML, Zook JM, Salzberg SL. Assembly and annotation of an Ashkenazi human reference genome. Genome Biol 2020; 21:129. [PMID: 32487205 PMCID: PMC7265644 DOI: 10.1186/s13059-020-02047-7] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2020] [Accepted: 05/15/2020] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Thousands of experiments and studies use the human reference genome as a resource each year. This single reference genome, GRCh38, is a mosaic created from a small number of individuals, representing a very small sample of the human population. There is a need for reference genomes from multiple human populations to avoid potential biases. RESULTS Here, we describe the assembly and annotation of the genome of an Ashkenazi individual and the creation of a new, population-specific human reference genome. This genome is more contiguous and more complete than GRCh38, the latest version of the human reference genome, and is annotated with highly similar gene content. The Ashkenazi reference genome, Ash1, contains 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. Annotation identified 20,157 protein-coding genes, of which 19,563 are > 99% identical to their counterparts on GRCh38. Most of the remaining genes have small differences. Forty of the protein-coding genes in GRCh38 are missing from Ash1; however, all of these genes are members of multi-gene families for which Ash1 contains other copies. Eleven genes appear on different chromosomes from their homologs in GRCh38. Alignment of DNA sequences from an unrelated Ashkenazi individual to Ash1 identified ~ 1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 genome, illustrating one of the benefits of population-specific reference genomes. CONCLUSIONS The Ash1 genome is presented as a reference for any genetic studies involving Ashkenazi Jewish individuals.
Collapse
Affiliation(s)
- Alaina Shumate
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Aleksey V Zimin
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Rachel M Sherman
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Daniela Puiu
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Justin M Wagner
- National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Nathan D Olson
- National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Marc L Salit
- Joint Initiative for Metrology in Biology, Stanford University, Stanford, CA, USA
| | - Justin M Zook
- National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Steven L Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
19
|
Marrano A, Britton M, Zaini PA, Zimin AV, Workman RE, Puiu D, Bianco L, Pierro EAD, Allen BJ, Chakraborty S, Troggio M, Leslie CA, Timp W, Dandekar A, Salzberg SL, Neale DB. High-quality chromosome-scale assembly of the walnut (Juglans regia L.) reference genome. Gigascience 2020; 9:giaa050. [PMID: 32432329 PMCID: PMC7238675 DOI: 10.1093/gigascience/giaa050] [Citation(s) in RCA: 59] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2019] [Revised: 03/13/2020] [Accepted: 04/20/2020] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND The release of the first reference genome of walnut (Juglans regia L.) enabled many achievements in the characterization of walnut genetic and functional variation. However, it is highly fragmented, preventing the integration of genetic, transcriptomic, and proteomic information to fully elucidate walnut biological processes. FINDINGS Here, we report the new chromosome-scale assembly of the walnut reference genome (Chandler v2.0) obtained by combining Oxford Nanopore long-read sequencing with chromosome conformation capture (Hi-C) technology. Relative to the previous reference genome, the new assembly features an 84.4-fold increase in N50 size, with the 16 chromosomal pseudomolecules assembled and representing 95% of its total length. Using full-length transcripts from single-molecule real-time sequencing, we predicted 37,554 gene models, with a mean gene length higher than the previous gene annotations. Most of the new protein-coding genes (90%) present both start and stop codons, which represents a significant improvement compared with Chandler v1.0 (only 48%). We then tested the potential impact of the new chromosome-level genome on different areas of walnut research. By studying the proteome changes occurring during male flower development, we observed that the virtual proteome obtained from Chandler v2.0 presents fewer artifacts than the previous reference genome, enabling the identification of a new potential pollen allergen in walnut. Also, the new chromosome-scale genome facilitates in-depth studies of intraspecies genetic diversity by revealing previously undetected autozygous regions in Chandler, likely resulting from inbreeding, and 195 genomic regions highly differentiated between Western and Eastern walnut cultivars. CONCLUSION Overall, Chandler v2.0 will serve as a valuable resource to better understand and explore walnut biology.
Collapse
Affiliation(s)
- Annarita Marrano
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| | - Monica Britton
- Bioinformatics Core Facility, Genome Center, University of California, One Shields Avenue, Davis, CA 95616, USA
| | - Paulo A Zaini
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, 720 Rutland Avenue, Baltimore, MD 21205, USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 3100 Wyman Park Dr., Baltimore, MD 21211, USA
| | - Rachael E Workman
- Department of Biomedical Engineering, Johns Hopkins University, 720 Rutland Avenue, Baltimore, MD 21205, USA
| | - Daniela Puiu
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 3100 Wyman Park Dr., Baltimore, MD 21211, USA
| | - Luca Bianco
- Research and Innovation Center, Fondazione Edmund Mach, Via E. Mach, 1 38010 S. Michele all'Adige (TN) 38010, Italy
| | - Erica Adele Di Pierro
- Research and Innovation Center, Fondazione Edmund Mach, Via E. Mach, 1 38010 S. Michele all'Adige (TN) 38010, Italy
| | - Brian J Allen
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| | - Sandeep Chakraborty
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| | - Michela Troggio
- Research and Innovation Center, Fondazione Edmund Mach, Via E. Mach, 1 38010 S. Michele all'Adige (TN) 38010, Italy
| | - Charles A Leslie
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, 720 Rutland Avenue, Baltimore, MD 21205, USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 3100 Wyman Park Dr., Baltimore, MD 21211, USA
| | - Abhaya Dandekar
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| | - Steven L Salzberg
- Department of Biomedical Engineering, Johns Hopkins University, 720 Rutland Avenue, Baltimore, MD 21205, USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 3100 Wyman Park Dr., Baltimore, MD 21211, USA
- Departments of Computer Science and Biostatistics, Johns Hopkins University, 3400 North Charles Street Baltimore, MD 21218, USA
| | - David B Neale
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| |
Collapse
|
20
|
Giordano R, Donthu RK, Zimin AV, Julca Chavez IC, Gabaldon T, van Munster M, Hon L, Hall R, Badger JH, Nguyen M, Flores A, Potter B, Giray T, Soto-Adames FN, Weber E, Marcelino JAP, Fields CJ, Voegtlin DJ, Hill CB, Hartman GL. Soybean aphid biotype 1 genome: Insights into the invasive biology and adaptive evolution of a major agricultural pest. Insect Biochem Mol Biol 2020; 120:103334. [PMID: 32109587 DOI: 10.1016/j.ibmb.2020.103334] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Revised: 01/07/2020] [Accepted: 02/10/2020] [Indexed: 05/12/2023]
Abstract
The soybean aphid, Aphis glycines Matsumura (Hemiptera: Aphididae) is a serious pest of the soybean plant, Glycine max, a major world-wide agricultural crop. We assembled a de novo genome sequence of Ap. glycines Biotype 1, from a culture established shortly after this species invaded North America. 20.4% of the Ap. glycines proteome is duplicated. These in-paralogs are enriched with Gene Ontology (GO) categories mostly related to apoptosis, a possible adaptation to plant chemistry and other environmental stressors. Approximately one-third of these genes show parallel duplication in other aphids. But Ap. gossypii, its closest related species, has the lowest number of these duplicated genes. An Illumina GoldenGate assay of 2380 SNPs was used to determine the world-wide population structure of Ap. Glycines. China and South Korean aphids are the closest to those in North America. China is the likely origin of other Asian aphid populations. The most distantly related aphids to those in North America are from Australia. The diversity of Ap. glycines in North America has decreased over time since its arrival. The genetic diversity of Ap. glycines North American population sampled shortly after its first detection in 2001 up to 2012 does not appear to correlate with geography. However, aphids collected on soybean Rag experimental varieties in Minnesota (MN), Iowa (IA), and Wisconsin (WI), closer to high density Rhamnus cathartica stands, appear to have higher capacity to colonize resistant soybean plants than aphids sampled in Ohio (OH), North Dakota (ND), and South Dakota (SD). Samples from the former states have SNP alleles with high FST values and frequencies, that overlap with genes involved in iron metabolism, a crucial metabolic pathway that may be affected by the Rag-associated soybean plant response. The Ap. glycines Biotype 1 genome will provide needed information for future analyses of mechanisms of aphid virulence and pesticide resistance as well as facilitate comparative analyses between aphids with differing natural history and host plant range.
Collapse
Affiliation(s)
- Rosanna Giordano
- Puerto Rico Science, Technology and Research Trust, San Juan, PR, USA; Know Your Bee, Inc. San Juan, PR, USA.
| | - Ravi Kiran Donthu
- Puerto Rico Science, Technology and Research Trust, San Juan, PR, USA; Know Your Bee, Inc. San Juan, PR, USA.
| | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Irene Consuelo Julca Chavez
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; Barcelona Supercomputing Centre (BSC-CNS), Barcelona, Spain; Institute for Research in Biomedicine, Barcelona, Spain
| | - Toni Gabaldon
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; Barcelona Supercomputing Centre (BSC-CNS), Barcelona, Spain; Institute for Research in Biomedicine, Barcelona, Spain; Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | - Manuella van Munster
- CIRAD-INRA-Montpellier SupAgro, TA A54/K, Campus International de Baillarguet, Montpellier, France
| | | | | | - Jonathan H Badger
- Cancer and Inflammation Program, Center for Cancer Research, National Cancer Institute, National Institute of Health, DHHS, Bethesda, MD, USA
| | - Minh Nguyen
- Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA
| | - Alejandra Flores
- College of Liberal Arts and Sciences, School of Molecular and Cellular Biology, University of Illinois, Urbana, IL, USA
| | - Bruce Potter
- University of Minnesota, Southwest Research and Outreach Center, Lamberton, MN, USA
| | - Tugrul Giray
- Department of Biology, University of Puerto Rico, San Juan, PR, USA
| | - Felipe N Soto-Adames
- Florida Department of Agriculture and Consumer Services, Division of Plant Industry, Entomology, Gainesville, FL, USA
| | | | - Jose A P Marcelino
- Puerto Rico Science, Technology and Research Trust, San Juan, PR, USA; Know Your Bee, Inc. San Juan, PR, USA; Department of Entomology and Nematology, University of Florida, Gainesville, FL, USA
| | - Christopher J Fields
- HPCBio, Roy J. Carver Biotechnology Center, University of Illinois, Urbana, IL, USA
| | - David J Voegtlin
- Illinois Natural History Survey, University of Illinois, Urbana, IL, USA
| | | | - Glen L Hartman
- USDA-ARS and Department of Crop Sciences, University of Illinois, Urbana, IL, USA
| |
Collapse
|
21
|
Marrano A, Britton M, Zaini PA, Zimin AV, Workman RE, Puiu D, Bianco L, Pierro EAD, Allen BJ, Chakraborty S, Troggio M, Leslie CA, Timp W, Dandekar A, Salzberg SL, Neale DB. High-quality chromosome-scale assembly of the walnut (Juglans regia L.) reference genome. Gigascience 2020. [PMID: 32432329 DOI: 10.1101/80979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/17/2023] Open
Abstract
BACKGROUND The release of the first reference genome of walnut (Juglans regia L.) enabled many achievements in the characterization of walnut genetic and functional variation. However, it is highly fragmented, preventing the integration of genetic, transcriptomic, and proteomic information to fully elucidate walnut biological processes. FINDINGS Here, we report the new chromosome-scale assembly of the walnut reference genome (Chandler v2.0) obtained by combining Oxford Nanopore long-read sequencing with chromosome conformation capture (Hi-C) technology. Relative to the previous reference genome, the new assembly features an 84.4-fold increase in N50 size, with the 16 chromosomal pseudomolecules assembled and representing 95% of its total length. Using full-length transcripts from single-molecule real-time sequencing, we predicted 37,554 gene models, with a mean gene length higher than the previous gene annotations. Most of the new protein-coding genes (90%) present both start and stop codons, which represents a significant improvement compared with Chandler v1.0 (only 48%). We then tested the potential impact of the new chromosome-level genome on different areas of walnut research. By studying the proteome changes occurring during male flower development, we observed that the virtual proteome obtained from Chandler v2.0 presents fewer artifacts than the previous reference genome, enabling the identification of a new potential pollen allergen in walnut. Also, the new chromosome-scale genome facilitates in-depth studies of intraspecies genetic diversity by revealing previously undetected autozygous regions in Chandler, likely resulting from inbreeding, and 195 genomic regions highly differentiated between Western and Eastern walnut cultivars. CONCLUSION Overall, Chandler v2.0 will serve as a valuable resource to better understand and explore walnut biology.
Collapse
Affiliation(s)
- Annarita Marrano
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| | - Monica Britton
- Bioinformatics Core Facility, Genome Center, University of California, One Shields Avenue, Davis, CA 95616, USA
| | - Paulo A Zaini
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, 720 Rutland Avenue, Baltimore, MD 21205, USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 3100 Wyman Park Dr., Baltimore, MD 21211, USA
| | - Rachael E Workman
- Department of Biomedical Engineering, Johns Hopkins University, 720 Rutland Avenue, Baltimore, MD 21205, USA
| | - Daniela Puiu
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 3100 Wyman Park Dr., Baltimore, MD 21211, USA
| | - Luca Bianco
- Research and Innovation Center, Fondazione Edmund Mach, Via E. Mach, 1 38010 S. Michele all'Adige (TN) 38010, Italy
| | - Erica Adele Di Pierro
- Research and Innovation Center, Fondazione Edmund Mach, Via E. Mach, 1 38010 S. Michele all'Adige (TN) 38010, Italy
| | - Brian J Allen
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| | - Sandeep Chakraborty
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| | - Michela Troggio
- Research and Innovation Center, Fondazione Edmund Mach, Via E. Mach, 1 38010 S. Michele all'Adige (TN) 38010, Italy
| | - Charles A Leslie
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, 720 Rutland Avenue, Baltimore, MD 21205, USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 3100 Wyman Park Dr., Baltimore, MD 21211, USA
| | - Abhaya Dandekar
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| | - Steven L Salzberg
- Department of Biomedical Engineering, Johns Hopkins University, 720 Rutland Avenue, Baltimore, MD 21205, USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 3100 Wyman Park Dr., Baltimore, MD 21211, USA
- Departments of Computer Science and Biostatistics, Johns Hopkins University, 3400 North Charles Street Baltimore, MD 21218, USA
| | - David B Neale
- Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA
| |
Collapse
|
22
|
Read AC, Moscou MJ, Zimin AV, Pertea G, Meyer RS, Purugganan MD, Leach JE, Triplett LR, Salzberg SL, Bogdanove AJ. Genome assembly and characterization of a complex zfBED-NLR gene-containing disease resistance locus in Carolina Gold Select rice with Nanopore sequencing. PLoS Genet 2020; 16:e1008571. [PMID: 31986137 PMCID: PMC7004385 DOI: 10.1371/journal.pgen.1008571] [Citation(s) in RCA: 72] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Revised: 02/06/2020] [Accepted: 12/16/2019] [Indexed: 12/26/2022] Open
Abstract
Long-read sequencing facilitates assembly of complex genomic regions. In plants, loci containing nucleotide-binding, leucine-rich repeat (NLR) disease resistance genes are an important example of such regions. NLR genes constitute one of the largest gene families in plants and are often clustered, evolving via duplication, contraction, and transposition. We recently mapped the Xo1 locus for resistance to bacterial blight and bacterial leaf streak, found in the American heirloom rice variety Carolina Gold Select, to a region that in the Nipponbare reference genome is NLR gene-rich. Here, toward identification of the Xo1 gene, we combined Nanopore and Illumina reads and generated a high-quality Carolina Gold Select genome assembly. We identified 529 complete or partial NLR genes and discovered, relative to Nipponbare, an expansion of NLR genes at the Xo1 locus. One of these has high sequence similarity to the cloned, functionally similar Xa1 gene. Both harbor an integrated zfBED domain, and the repeats within each protein are nearly perfect. Across diverse Oryzeae, we identified two sub-clades of NLR genes with these features, varying in the presence of the zfBED domain and the number of repeats. The Carolina Gold Select genome assembly also uncovered at the Xo1 locus a rice blast resistance gene and a gene encoding a polyphenol oxidase (PPO). PPO activity has been used as a marker for blast resistance at the locus in some varieties; however, the Carolina Gold Select sequence revealed a loss-of-function mutation in the PPO gene that breaks this association. Our results demonstrate that whole genome sequencing combining Nanopore and Illumina reads effectively resolves NLR gene loci. Our identification of an Xo1 candidate is an important step toward mechanistic characterization, including the role(s) of the zfBED domain. Finally, the Carolina Gold Select genome assembly will facilitate identification of other useful traits in this historically important variety. Plants lack adaptive immunity, and instead contain repeat-rich, disease resistance genes that evolve rapidly through duplication, recombination, and transposition. The number, variation, and often clustered arrangement of these genes make them challenging to sequence and catalog. The US heirloom rice variety Carolina Gold Select has resistance to two important bacterial diseases. Toward identifying the responsible gene(s), we combined long- and short-read sequencing technologies to assemble the whole genome and identify the resistance gene repertoire. We previously narrowed the location of the gene(s) to a region on chromosome four. The region in Carolina Gold Select is larger than in the rice reference genome (Nipponbare) and contains twice as many resistance genes. One shares unusual features with a known bacterial disease resistance gene, suggesting that it confers the resistance. Across diverse varieties and related species, we identified two widely-distributed groups of such genes. The results are an important step toward mechanistic characterization and deployment of the bacterial disease resistance. The genome assembly also identified a resistance gene for a fungal disease and predicted a marker phenotype used in breeding for resistance. Thus, the Carolina Gold Select genome assembly can be expected to aid in the identification and deployment of other valuable traits.
Collapse
Affiliation(s)
- Andrew C. Read
- Plant Pathology and Plant Microbe Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY, United States of America
| | - Matthew J. Moscou
- The Sainsbury Laboratory, University of East Anglia, Norwich, United Kingdom
| | - Aleksey V. Zimin
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, United States of America
| | - Geo Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, United States of America
| | - Rachel S. Meyer
- Center for Genomics and Systems Biology, New York University, New York, NY, United States of America
| | - Michael D. Purugganan
- Center for Genomics and Systems Biology, New York University, New York, NY, United States of America
- Center for Genomics and Biology, New York University Abu Dhabi, Saadiyat Island, Abu Dhabi, United Arab Emirates
| | - Jan E. Leach
- Department of Bioagricultural Sciences and Pest Management, Colorado State University, Fort Collins, CO, United States of America
| | - Lindsay R. Triplett
- Department of Bioagricultural Sciences and Pest Management, Colorado State University, Fort Collins, CO, United States of America
| | - Steven L. Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, United States of America
- Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, United States of America
| | - Adam J. Bogdanove
- Plant Pathology and Plant Microbe Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY, United States of America
- * E-mail:
| |
Collapse
|
23
|
Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol 2019; 20:278. [PMID: 31842956 PMCID: PMC6912988 DOI: 10.1186/s13059-019-1910-1] [Citation(s) in RCA: 656] [Impact Index Per Article: 131.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2019] [Accepted: 12/02/2019] [Indexed: 11/13/2022] Open
Abstract
RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.
Collapse
Affiliation(s)
- Sam Kovaka
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
| | - Aleksey V. Zimin
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Geo M. Pertea
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Roham Razaghi
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Steven L. Salzberg
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205 USA
| | - Mihaela Pertea
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| |
Collapse
|
24
|
Breitwieser FP, Pertea M, Zimin AV, Salzberg SL. Human contamination in bacterial genomes has created thousands of spurious proteins. Genome Res 2019; 29:954-960. [PMID: 31064768 PMCID: PMC6581058 DOI: 10.1101/gr.245373.118] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 05/03/2019] [Indexed: 01/22/2023]
Abstract
Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38. The absence of the sequences from the human assembly offers a likely explanation for their presence in bacterial assemblies. In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein “families” across multiple prokaryotic and eukaryotic genomes. As a result, 3437 spurious protein entries are currently present in the widely used nr and TrEMBL protein databases. We report here an extensive list of contaminant sequences in bacterial genome assemblies and the proteins associated with them. We found that nearly all contaminants occurred in small contigs in draft genomes, which suggests that filtering out small contigs from draft genome assemblies may mitigate the issue of contamination while still keeping nearly all of the genuine genomic sequences.
Collapse
Affiliation(s)
- Florian P Breitwieser
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
| | - Mihaela Pertea
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA.,Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Aleksey V Zimin
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Steven L Salzberg
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA.,Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA.,Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland 21205, USA
| |
Collapse
|
25
|
Zimin AV, Puiu D, Hall R, Kingan S, Clavijo BJ, Salzberg SL. The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum. Gigascience 2018; 6:1-7. [PMID: 29069494 PMCID: PMC5691383 DOI: 10.1093/gigascience/gix097] [Citation(s) in RCA: 167] [Impact Index Per Article: 27.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Accepted: 09/28/2017] [Indexed: 01/17/2023] Open
Abstract
Common bread wheat, Triticum aestivum, has one of the most complex genomes known to science, with 6 copies of each chromosome, enormous numbers of near-identical sequences scattered throughout, and an overall haploid size of more than 15 billion bases. Multiple past attempts to assemble the genome have produced assemblies that were well short of the estimated genome size. Here we report the first near-complete assembly of T. aestivum, using deep sequencing coverage from a combination of short Illumina reads and very long Pacific Biosciences reads. The final assembly contains 15 344 693 583 bases and has a weighted average (N50) contig size of 232 659 bases. This represents by far the most complete and contiguous assembly of the wheat genome to date, providing a strong foundation for future genetic studies of this important food crop. We also report how we used the recently published genome of Aegilops tauschii, the diploid ancestor of the wheat D genome, to identify 4 179 762 575 bp of T. aestivum that correspond to its D genome components.
Collapse
Affiliation(s)
- Aleksey V Zimin
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.,Institute for Physical Sciences and Technology, University of Maryland, College Park, MD 20742, USA
| | - Daniela Puiu
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Richard Hall
- Pacific Biosciences, 1305 O'Brien Dr, Menlo Park, CA 94025, USA
| | - Sarah Kingan
- Pacific Biosciences, 1305 O'Brien Dr, Menlo Park, CA 94025, USA
| | - Bernardo J Clavijo
- Earlham Institute, Norwich Research Park Innovation Centre, Colney Ln, Norwich NR4 7UZ, UK
| | - Steven L Salzberg
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.,Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA
| |
Collapse
|
26
|
Zimin AV, Stevens KA, Crepeau MW, Puiu D, Wegrzyn JL, Yorke JA, Langley CH, Neale DB, Salzberg SL. Erratum to: An improved assembly of the loblolly pine mega-genome using long-read single-molecule sequencing. Gigascience 2017; 6:1. [PMID: 29020755 PMCID: PMC5632297 DOI: 10.1093/gigascience/gix072] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
The 22-gigabase genome of loblolly pine (Pinus taeda) is one of the largest ever sequenced. The draft assembly published in 2014 was built entirely from short Illumina reads, with lengths ranging from 100 to 250 base pairs (bp). The assembly was quite fragmented, containing over 11 million contigs whose weighted average (N50) size was 8206 bp. To improve this result, we generated approximately 12-fold coverage in long reads using the Single Molecule Real Time sequencing technology developed at Pacific Biosciences. We assembled the long and short reads together using the MaSuRCA mega-reads assembly algorithm, which produced a substantially better assembly, P. taeda version 2.0. The new assembly has an N50 contig size of 25 361, more than three times as large as achieved in the original assembly, and an N50 scaffold size of 107 821, 61% larger than the previous assembly.
Collapse
|
27
|
Zimin AV, Stevens KA, Crepeau MW, Puiu D, Wegrzyn JL, Yorke JA, Langley CH, Neale DB, Salzberg SL. An improved assembly of the loblolly pine mega-genome using long-read single-molecule sequencing. Gigascience 2017; 6:1-4. [PMID: 28369353 PMCID: PMC5437942 DOI: 10.1093/gigascience/giw016] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Accepted: 12/21/2016] [Indexed: 11/30/2022] Open
Abstract
The 22-gigabase genome of loblolly pine (Pinus taeda) is one of the largest ever sequenced. The draft assembly published in 2014 was built entirely from short Illumina reads, with lengths ranging from 100 to 250 base pairs (bp). The assembly was quite fragmented, containing over 11 million contigs whose weighted average (N50) size was 8206 bp. To improve this result, we generated approximately 12-fold coverage in long reads using the Single Molecule Real Time sequencing technology developed at Pacific Biosciences. We assembled the long and short reads together using the MaSuRCA mega-reads assembly algorithm, which produced a substantially better assembly, P. taeda version 2.0. The new assembly has an N50 contig size of 25 361, more than three times as large as achieved in the original assembly, and an N50 scaffold size of 107 821, 61% larger than the previous assembly.
Collapse
Affiliation(s)
- Aleksey V Zimin
- Institute for Physical Sciences and Technology, University of Maryland, College Park, MD.,Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD
| | - Kristian A Stevens
- Department of Evolution and Ecology, University of California, Davis, CA
| | - Marc W Crepeau
- Department of Evolution and Ecology, University of California, Davis, CA
| | - Daniela Puiu
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD
| | - Jill L Wegrzyn
- Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT
| | - James A Yorke
- Institute for Physical Sciences and Technology, University of Maryland, College Park, MD
| | - Charles H Langley
- Department of Evolution and Ecology, University of California, Davis, CA
| | - David B Neale
- Department of Plant Sciences, University of California, Davis, CA
| | - Steven L Salzberg
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD.,Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD
| |
Collapse
|
28
|
Zimin AV, Puiu D, Luo MC, Zhu T, Koren S, Marçais G, Yorke JA, Dvořák J, Salzberg SL. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res 2017. [PMID: 28130360 DOI: 10.1101/gr.2134c5.116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.
Collapse
Affiliation(s)
- Aleksey V Zimin
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
- Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA
| | - Daniela Puiu
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
| | - Ming-Cheng Luo
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Tingting Zhu
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Sergey Koren
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Guillaume Marçais
- Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
| | - James A Yorke
- Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA
- Departments of Mathematics and Physics, University of Maryland, College Park, Maryland 20742, USA
| | - Jan Dvořák
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Steven L Salzberg
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
- Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, Maryland 21218, USA
| |
Collapse
|
29
|
Zimin AV, Puiu D, Luo MC, Zhu T, Koren S, Marçais G, Yorke JA, Dvořák J, Salzberg SL. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res 2017; 27:787-792. [PMID: 28130360 PMCID: PMC5411773 DOI: 10.1101/gr.213405.116] [Citation(s) in RCA: 240] [Impact Index Per Article: 34.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Accepted: 01/18/2017] [Indexed: 01/12/2023]
Abstract
Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.
Collapse
Affiliation(s)
- Aleksey V Zimin
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA.,Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA
| | - Daniela Puiu
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
| | - Ming-Cheng Luo
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Tingting Zhu
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Sergey Koren
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Guillaume Marçais
- Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA.,Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
| | - James A Yorke
- Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA.,Departments of Mathematics and Physics, University of Maryland, College Park, Maryland 20742, USA
| | - Jan Dvořák
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Steven L Salzberg
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA.,Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, Maryland 21218, USA
| |
Collapse
|
30
|
Zimin AV, Cornish AS, Maudhoo MD, Gibbs RM, Zhang X, Pandey S, Meehan DT, Wipfler K, Bosinger SE, Johnson ZP, Tharp GK, Marçais G, Roberts M, Ferguson B, Fox HS, Treangen T, Salzberg SL, Yorke JA, Norgren RB. A new rhesus macaque assembly and annotation for next-generation sequencing analyses. Biol Direct 2014; 9:20. [PMID: 25319552 PMCID: PMC4214606 DOI: 10.1186/1745-6150-9-20] [Citation(s) in RCA: 136] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2014] [Accepted: 10/03/2014] [Indexed: 12/13/2022] Open
Abstract
Background The rhesus macaque (Macaca mulatta) is a key species for advancing biomedical research. Like all draft mammalian genomes, the draft rhesus assembly (rheMac2) has gaps, sequencing errors and misassemblies that have prevented automated annotation pipelines from functioning correctly. Another rhesus macaque assembly, CR_1.0, is also available but is substantially more fragmented than rheMac2 with smaller contigs and scaffolds. Annotations for these two assemblies are limited in completeness and accuracy. High quality assembly and annotation files are required for a wide range of studies including expression, genetic and evolutionary analyses. Results We report a new de novo assembly of the rhesus macaque genome (MacaM) that incorporates both the original Sanger sequences used to assemble rheMac2 and new Illumina sequences from the same animal. MacaM has a weighted average (N50) contig size of 64 kilobases, more than twice the size of the rheMac2 assembly and almost five times the size of the CR_1.0 assembly. The MacaM chromosome assembly incorporates information from previously unutilized mapping data and preliminary annotation of scaffolds. Independent assessment of the assemblies using Ion Torrent read alignments indicates that MacaM is more complete and accurate than rheMac2 and CR_1.0. We assembled messenger RNA sequences from several rhesus tissues into transcripts which allowed us to identify a total of 11,712 complete proteins representing 9,524 distinct genes. Using a combination of our assembled rhesus macaque transcripts and human transcripts, we annotated 18,757 transcripts and 16,050 genes with complete coding sequences in the MacaM assembly. Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2. Finally, we show that the MacaM genome provides an accurate resource for alignment of reads produced by RNA sequence expression studies. Conclusions The MacaM assembly and annotation files provide a substantially more complete and accurate representation of the rhesus macaque genome than rheMac2 or CR_1.0 and will serve as an important resource for investigators conducting next-generation sequencing studies with nonhuman primates. Reviewers This article was reviewed by Dr. Lutz Walter, Dr. Soojin Yi and Dr. Kateryna Makova.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Robert B Norgren
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska 68198, USA.
| |
Collapse
|
31
|
Neale DB, Wegrzyn JL, Stevens KA, Zimin AV, Puiu D, Crepeau MW, Cardeno C, Koriabine M, Holtz-Morris AE, Liechty JD, Martínez-García PJ, Vasquez-Gross HA, Lin BY, Zieve JJ, Dougherty WM, Fuentes-Soriano S, Wu LS, Gilbert D, Marçais G, Roberts M, Holt C, Yandell M, Davis JM, Smith KE, Dean JFD, Lorenz WW, Whetten RW, Sederoff R, Wheeler N, McGuire PE, Main D, Loopstra CA, Mockaitis K, deJong PJ, Yorke JA, Salzberg SL, Langley CH. Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies. Genome Biol 2014; 15:R59. [PMID: 24647006 PMCID: PMC4053751 DOI: 10.1186/gb-2014-15-3-r59] [Citation(s) in RCA: 274] [Impact Index Per Article: 27.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2014] [Accepted: 03/04/2014] [Indexed: 11/30/2022] Open
Abstract
Background The size and complexity of conifer genomes has, until now, prevented full genome sequencing and assembly. The large research community and economic importance of loblolly pine, Pinus taeda L., made it an early candidate for reference sequence determination. Results We develop a novel strategy to sequence the genome of loblolly pine that combines unique aspects of pine reproductive biology and genome assembly methodology. We use a whole genome shotgun approach relying primarily on next generation sequence generated from a single haploid seed megagametophyte from a loblolly pine tree, 20-1010, that has been used in industrial forest tree breeding. The resulting sequence and assembly was used to generate a draft genome spanning 23.2 Gbp and containing 20.1 Gbp with an N50 scaffold size of 66.9 kbp, making it a significant improvement over available conifer genomes. The long scaffold lengths allow the annotation of 50,172 gene models with intron lengths averaging over 2.7 kbp and sometimes exceeding 100 kbp in length. Analysis of orthologous gene sets identifies gene families that may be unique to conifers. We further characterize and expand the existing repeat library based on the de novo analysis of the repetitive content, estimated to encompass 82% of the genome. Conclusions In addition to its value as a resource for researchers and breeders, the loblolly pine genome sequence and assembly reported here demonstrates a novel approach to sequencing the large and complex genomes of this important group of plants that can now be widely applied.
Collapse
|
32
|
Dalloul RA, Zimin AV, Settlage RE, Kim S, Reed KM. Next-generation sequencing strategies for characterizing the turkey genome. Poult Sci 2014; 93:479-84. [DOI: 10.3382/ps.2013-03560] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
|
33
|
Abstract
MOTIVATION Second-generation sequencing technologies produce high coverage of the genome by short reads at a low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this article, we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms large numbers of paired-end reads into a much smaller number of longer 'super-reads'. The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced 'mazurka'). RESULTS We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two datasets from organisms for which high-quality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads. AVAILABILITY MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year. CONTACT alekseyz@ipst.umd.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aleksey V Zimin
- Institute for Physical Sciences and Technology, University of Maryland, College Park, MD 20742, USA, Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA, Department of Mathematics and Department of Physics, University of Maryland, College Park, MD 20742, USA
| | | | | | | | | | | |
Collapse
|
34
|
Zimin AV, Kelley DR, Roberts M, Marçais G, Salzberg SL, Yorke JA. Mis-assembled "segmental duplications" in two versions of the Bos taurus genome. PLoS One 2012; 7:e42680. [PMID: 22880081 PMCID: PMC3411808 DOI: 10.1371/journal.pone.0042680] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2011] [Accepted: 07/11/2012] [Indexed: 01/06/2023] Open
Abstract
We analyzed the whole genome sequence coverage in two versions of the Bos taurus genome and identified all regions longer than five kilobases (Kbp) that are duplicated within chromosomes with >99% sequence fidelity in both copies. We call these regions High Fidelity Duplications (HFDs). The two assemblies were Btau 4.2, produced by the Human Genome Sequencing Center at Baylor College of Medicine, and UMD Bos taurus 3.1 (UMD 3.1), produced by our group at the University of Maryland. We found that Btau 4.2 has a far greater number of HFDs, 3111 versus only 69 in UMD 3.1. Read coverage analysis shows that 39 million base pairs (Mbp) of sequence in HFDs in Btau 4.2 appear to be a result of a mis-assembly and therefore cannot be qualified as segmental duplications. UMD 3.1 has only 0.41 Mbp of sequence in HFDs that are due to a mis-assembly.
Collapse
Affiliation(s)
- Aleksey V Zimin
- Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America
| | | | | | | | | | | |
Collapse
|
35
|
Dalloul RA, Long JA, Zimin AV, Aslam L, Beal K, Ann Blomberg L, Bouffard P, Burt DW, Crasta O, Crooijmans RPMA, Cooper K, Coulombe RA, De S, Delany ME, Dodgson JB, Dong JJ, Evans C, Frederickson KM, Flicek P, Florea L, Folkerts O, Groenen MAM, Harkins TT, Herrero J, Hoffmann S, Megens HJ, Jiang A, de Jong P, Kaiser P, Kim H, Kim KW, Kim S, Langenberger D, Lee MK, Lee T, Mane S, Marcais G, Marz M, McElroy AP, Modise T, Nefedov M, Notredame C, Paton IR, Payne WS, Pertea G, Prickett D, Puiu D, Qioa D, Raineri E, Ruffier M, Salzberg SL, Schatz MC, Scheuring C, Schmidt CJ, Schroeder S, Searle SMJ, Smith EJ, Smith J, Sonstegard TS, Stadler PF, Tafer H, Tu Z(J, Van Tassell CP, Vilella AJ, Williams KP, Yorke JA, Zhang L, Zhang HB, Zhang X, Zhang Y, Reed KM. Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis. PLoS Biol 2010; 8:e1000475. [PMID: 20838655 PMCID: PMC2935454 DOI: 10.1371/journal.pbio.1000475] [Citation(s) in RCA: 320] [Impact Index Per Article: 22.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2009] [Accepted: 07/27/2010] [Indexed: 12/11/2022] Open
Abstract
A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo). Heterozygosity of the sequenced source genome allowed discovery of more than 600,000 high quality single nucleotide variants. Despite this heterozygosity, the current genome assembly (∼1.1 Gb) includes 917 Mb of sequence assigned to specific turkey chromosomes. Annotation identified nearly 16,000 genes, with 15,093 recognized as protein coding and 611 as non-coding RNA genes. Comparative analysis of the turkey, chicken, and zebra finch genomes, and comparing avian to mammalian species, supports the characteristic stability of avian genomes and identifies genes unique to the avian lineage. Clear differences are seen in number and variety of genes of the avian immune system where expansions and novel genes are less frequent than examples of gene loss. The turkey genome sequence provides resources to further understand the evolution of vertebrate genomes and genetic variation underlying economically important quantitative traits in poultry. This integrated approach may be a model for providing both gene and chromosome level assemblies of other species with agricultural, ecological, and evolutionary interest.
Collapse
Affiliation(s)
- Rami A. Dalloul
- Avian Immunobiology Laboratory, Department of Animal and Poultry Sciences, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Julie A. Long
- Animal Biosciences and Biotechnology Laboratory, USDA Agricultural Research Service, Beltsville, Maryland, United States of America
| | - Aleksey V. Zimin
- Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America
| | - Luqman Aslam
- Animal Breeding and Genomics Centre, Wageningen University, Wageningen, the Netherlands
| | - Kathryn Beal
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Le Ann Blomberg
- Animal Biosciences and Biotechnology Laboratory, USDA Agricultural Research Service, Beltsville, Maryland, United States of America
| | - Pascal Bouffard
- Roche Applied Science, Indianapolis, Indiana, United States of America
| | - David W. Burt
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Roslin, Midlothian, United Kingdom
| | - Oswald Crasta
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, United States of America
- Chromatin Inc., Champaign, Illinois, United States of America
| | | | - Kristal Cooper
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Roger A. Coulombe
- Department of Veterinary Sciences, Utah State University, Logan, Utah, United States of America
| | - Supriyo De
- Gene Expression and Genomics Unit, National Institute on Aging, National Institutes of Health, Baltimore, Maryland, United States of America
| | - Mary E. Delany
- Department of Animal Science, University of California, Davis, California, United States of America
| | - Jerry B. Dodgson
- Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America
| | - Jennifer J. Dong
- Department of Soil and Crop Sciences, Texas A&M University, College Station, Texas, United States of America
| | - Clive Evans
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, United States of America
| | | | - Paul Flicek
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Liliana Florea
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - Otto Folkerts
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, United States of America
- Chromatin Inc., Champaign, Illinois, United States of America
| | - Martien A. M. Groenen
- Animal Breeding and Genomics Centre, Wageningen University, Wageningen, the Netherlands
| | - Tim T. Harkins
- Roche Applied Science, Indianapolis, Indiana, United States of America
| | - Javier Herrero
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Steve Hoffmann
- Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany
- LIFE Project, University of Leipzig, Leipzig, Germany
| | - Hendrik-Jan Megens
- Animal Breeding and Genomics Centre, Wageningen University, Wageningen, the Netherlands
| | - Andrew Jiang
- Department of Animal Science, University of California, Davis, California, United States of America
| | - Pieter de Jong
- Children's Hospital and Research Center at Oakland, Oakland, California, United States of America
| | - Pete Kaiser
- Institute for Animal Health, Compton, Berkshire, United Kingdom
| | - Heebal Kim
- Laboratory of Bioinformatics and Population Genetics, Department of Agricultural Biotechnology, Seoul National University, Seoul, Korea
| | - Kyu-Won Kim
- Laboratory of Bioinformatics and Population Genetics, Department of Agricultural Biotechnology, Seoul National University, Seoul, Korea
| | - Sungwon Kim
- Avian Immunobiology Laboratory, Department of Animal and Poultry Sciences, Virginia Tech, Blacksburg, Virginia, United States of America
| | - David Langenberger
- Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany
| | - Mi-Kyung Lee
- Department of Soil and Crop Sciences, Texas A&M University, College Station, Texas, United States of America
| | - Taeheon Lee
- Laboratory of Bioinformatics and Population Genetics, Department of Agricultural Biotechnology, Seoul National University, Seoul, Korea
| | - Shrinivasrao Mane
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Guillaume Marcais
- Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America
| | - Manja Marz
- Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany
- Philipps-Universität Marburg, Pharmazeutische Chemie, Marburg, Germany
| | - Audrey P. McElroy
- Avian Immunobiology Laboratory, Department of Animal and Poultry Sciences, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Thero Modise
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Mikhail Nefedov
- Children's Hospital and Research Center at Oakland, Oakland, California, United States of America
| | - Cédric Notredame
- Comparative Bioinformatics, Centre for Genomic Regulation (CRG), Universitat Pompeus Fabre, Barcelona, Spain
| | - Ian R. Paton
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Roslin, Midlothian, United Kingdom
| | - William S. Payne
- Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America
| | - Geo Pertea
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - Dennis Prickett
- Institute for Animal Health, Compton, Berkshire, United Kingdom
| | - Daniela Puiu
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - Dan Qioa
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Emanuele Raineri
- Comparative Bioinformatics, Centre for Genomic Regulation (CRG), Universitat Pompeus Fabre, Barcelona, Spain
| | - Magali Ruffier
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Steven L. Salzberg
- Center for Bioinformatics and Computational Biology, Department of Computer Science, University of Maryland, College Park, Maryland, United States of America
| | - Michael C. Schatz
- Center for Bioinformatics and Computational Biology, Department of Computer Science, University of Maryland, College Park, Maryland, United States of America
| | - Chantel Scheuring
- Department of Soil and Crop Sciences, Texas A&M University, College Station, Texas, United States of America
| | - Carl J. Schmidt
- Department of Animal and Food Sciences, University of Delaware, Newark, Delaware, United States of America
| | - Steven Schroeder
- Bovine Functional Genomics Laboratory, USDA Agricultural Research Service, Beltsville Agricultural Research Center, Beltsville, Maryland, United States of America
| | - Stephen M. J. Searle
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Edward J. Smith
- Avian Immunobiology Laboratory, Department of Animal and Poultry Sciences, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Jacqueline Smith
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Roslin, Midlothian, United Kingdom
| | - Tad S. Sonstegard
- Bovine Functional Genomics Laboratory, USDA Agricultural Research Service, Beltsville Agricultural Research Center, Beltsville, Maryland, United States of America
| | - Peter F. Stadler
- Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
- Fraunhofer Institut für Zelltherapie und Immunologie, Leipzig, Germany
- Department of Theoretical Chemistry University of Vienna, Vienna, Austria
- Santa Fe Institute, Santa Fe, New Mexico, United States of America
| | - Hakim Tafer
- Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany
- Department of Theoretical Chemistry University of Vienna, Vienna, Austria
| | - Zhijian (Jake) Tu
- Department of Biochemistry, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Curtis P. Van Tassell
- Bovine Functional Genomics Laboratory, USDA Agricultural Research Service, Beltsville Agricultural Research Center, Beltsville, Maryland, United States of America
- Animal Improvement Programs Laboratory, USDA Agricultural Research Service, Beltsville Agricultural Research Center, Beltsville, Maryland, United States of America
| | - Albert J. Vilella
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Kelly P. Williams
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, United States of America
| | - James A. Yorke
- Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America
| | - Liqing Zhang
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Hong-Bin Zhang
- Department of Soil and Crop Sciences, Texas A&M University, College Station, Texas, United States of America
| | - Xiaojun Zhang
- Department of Soil and Crop Sciences, Texas A&M University, College Station, Texas, United States of America
| | - Yang Zhang
- Department of Soil and Crop Sciences, Texas A&M University, College Station, Texas, United States of America
| | - Kent M. Reed
- Department of Veterinary and Biomedical Sciences, College of Veterinary Medicine, University of Minnesota, St. Paul, Minnesota, United States of America
| |
Collapse
|
36
|
Zimin AV, Delcher AL, Florea L, Kelley DR, Schatz MC, Puiu D, Hanrahan F, Pertea G, Van Tassell CP, Sonstegard TS, Marçais G, Roberts M, Subramanian P, Yorke JA, Salzberg SL. A whole-genome assembly of the domestic cow, Bos taurus. Genome Biol 2009; 10:R42. [PMID: 19393038 PMCID: PMC2688933 DOI: 10.1186/gb-2009-10-4-r42] [Citation(s) in RCA: 827] [Impact Index Per Article: 55.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2009] [Revised: 02/06/2009] [Accepted: 04/24/2009] [Indexed: 12/02/2022] Open
Abstract
A cow whole-genome assembly of 2.86 billion base pairs that closes gaps and corrects previously-described inversions and deletions as well as describing a portion of the Y chromosome. Background The genome of the domestic cow, Bos taurus, was sequenced using a mixture of hierarchical and whole-genome shotgun sequencing methods. Results We have assembled the 35 million sequence reads and applied a variety of assembly improvement techniques, creating an assembly of 2.86 billion base pairs that has multiple improvements over previous assemblies: it is more complete, covering more of the genome; thousands of gaps have been closed; many erroneous inversions, deletions, and translocations have been corrected; and thousands of single-nucleotide errors have been corrected. Our evaluation using independent metrics demonstrates that the resulting assembly is substantially more accurate and complete than alternative versions. Conclusions By using independent mapping data and conserved synteny between the cow and human genomes, we were able to construct an assembly with excellent large-scale contiguity in which a large majority (approximately 91%) of the genome has been placed onto the 30 B. taurus chromosomes. We constructed a new cow-human synteny map that expands upon previous maps. We also identified for the first time a portion of the B. taurus Y chromosome.
Collapse
Affiliation(s)
- Aleksey V Zimin
- Institute for Physical Science and Technology, University of Maryland, College Park, Maryland 20742, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
37
|
Roberts M, Zimin AV, Hayes W, Hunt BR, Ustun C, White JR, Havlak P, Yorke J. Improving Phrap-based assembly of the rat using "reliable" overlaps. PLoS One 2008; 3:e1836. [PMID: 18350171 PMCID: PMC2266800 DOI: 10.1371/journal.pone.0001836] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2007] [Accepted: 02/09/2008] [Indexed: 12/02/2022] Open
Abstract
The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes. We present a novel algorithm to generate a set of “reliable” overlaps based on identifying repeat k-mers. To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list. We call this version PhrapUMD. Integrating PhrapUMD and our “reliable-overlap” algorithm with the Baylor College of Medicine assembler, Atlas, we assemble the BACs from the Rattus norvegicus genome project. Starting with the same data as the Nov. 2002 Atlas assembly, we compare our results and the Atlas assembly to the 4.3 Mb of rat sequence in the 21 BACs that have been finished. Our version of the draft assembly of the 21 BACs increases the coverage of finished sequence from 93.4% to 96.3%, while simultaneously reducing the base error rate from 4.5 to 1.1 errors per 10,000 bases. There are a number of ways of assessing the relative merits of assemblies when the finished sequence is available. If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better. The UMD Overlapper with options for reliable overlaps is available from the authors at http://www.genome.umd.edu. We also provide the changes to the Phrap source code enabling it to use only the reliable overlaps.
Collapse
Affiliation(s)
- Michael Roberts
- Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America
| | - Aleksey V. Zimin
- Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America
- * E-mail:
| | - Wayne Hayes
- Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America
| | - Brian R. Hunt
- Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America
| | - Cevat Ustun
- Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America
| | - James R. White
- Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America
| | - Paul Havlak
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
| | - James Yorke
- Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America
| |
Collapse
|
38
|
Abstract
MOTIVATION Many genomes are sequenced by a collaboration of several centers, and then each center produces an assembly using their own assembly software. The collaborators then pick the draft assembly that they judge to be the best and the information contained in the other assemblies is usually not used. METHODS We have developed a technique that we call assembly reconciliation that can merge draft genome assemblies. It takes one draft assembly, detects apparent errors, and, when possible, patches the problem areas using pieces from alternative draft assemblies. It also closes gaps in places where one of the alternative assemblies has spanned the gap correctly. RESULTS Using the Assembly Reconciliation technique, we produced reconciled assemblies of six Drosophila species in collaboration with Agencourt Bioscience and The J. Craig Venter Institute. These assemblies are now the official (CAF1) assemblies used for analysis. We also produced a reconciled assembly of Rhesus Macaque genome, and this assembly is available from our website http://www.genome.umd.edu. AVAILABILITY The reconciliation software is available for download from http://www.genome.umd.edu/software.htm
Collapse
Affiliation(s)
- Aleksey V Zimin
- IPST, University of Maryland, College Park, Agencourt Bioscience Inc., Beverly, MA.
| | | | | | | |
Collapse
|
39
|
Abstract
Dynamical systems with chaos on an invariant submanifold can exhibit a type of behavior called bubbling, whereby a small random or fixed perturbation to the system induces intermittent bursting. The bifurcation to bubbling occurs when a periodic orbit embedded in the chaotic attractor in the invariant manifold becomes unstable to perturbations transverse to the invariant manifold. Generically the periodic orbit can become transversely unstable through a pitchfork, transcritical, period-doubling, or Hopf bifurcation. In this paper a unified treatment of the four types of bubbling bifurcation is presented. Conditions are obtained determining whether the transition to bubbling is soft or hard; that is, whether the maximum burst amplitude varies continuously or discontinuously with variation of the parameter through its critical value. For soft bubbling transitions, the scaling of the maximum burst amplitude with the parameter is derived. For both hard and soft transitions the scaling of the average interburst time with the bifurcation parameter is deduced. Both random (noise) and fixed (mismatch) perturbations are considered. Results of numerical experiments testing our theoretical predictions are presented.
Collapse
Affiliation(s)
- Aleksey V Zimin
- Department of Physics, Box 240, Physics Building, University of Maryland, College Park, Maryland 20742, USA.
| | | | | |
Collapse
|
40
|
Chukina EA, Lapshin VP, Kliukvin II, Okhotskiĭ VP, Zvezdina MV, Larionov KS, Zimin AV. [Millimeter wavelength electromagnetic irradiation in the complex treatment of patients with extensive bite wounds]. Vopr Kurortol Fizioter Lech Fiz Kult 2001:45-7. [PMID: 11785340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 02/23/2023]
Abstract
Patients with extensive bite wounds given conventional treatment and exposed to millimetric electromagnetic waves (MEW) were compared by therapeutic benefit. MEW treatment early after the trauma enhances regeneration in the lesion thus raising therapeutic efficiency of the treatment because the exposure to MEW in hydration phase provides a fast relief of soft tissue edema, lessens intoxication, stimulates adaptation.
Collapse
|