1
|
Gao Y, Yang L, Kuhn K, Li W, Zanton G, Bowman M, Zhao P, Zhou Y, Fang L, Cole JB, Rosen BD, Ma L, Li C, Baldwin RL, Van Tassell CP, Zhang Z, Smith TPL, Liu GE. Long read and preliminary pangenome analyses reveal breed-specific structural variations and novel sequences in Holstein and Jersey cattle. J Adv Res 2025:S2090-1232(25)00258-9. [PMID: 40258473 DOI: 10.1016/j.jare.2025.04.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2024] [Revised: 04/06/2025] [Accepted: 04/10/2025] [Indexed: 04/23/2025] Open
Abstract
INTRODUCTION Most SV studies in livestock rely on short-read sequencing, posing challenges in accurately characterizing large genomic variants due to their limited read length. OBJECTIVES Our goal is to reveal structural variation and novel sequences specific to Holstein and Jersey cattle breeds using long-read and pan-genome analyses. METHODS We sequenced 20 Holsteins and 8 Jersey cattle using PacBio HiFi to 20×, and integrated five read-based and one assembly-based SV caller to determine SVs. RESULTS We assembled the 28 genomes averaging 3.25 Gb with a contig N50 of 69.36 Mb and using the ARS-UCD1.2 reference, we acquired Holstein/Jersey SV catalogs with 74,068/54,689 events spanning 202/135 Mb (7.43 %/4.97 % of the genome). SVs were enriched in less conserved, non-coding, and non-regulatory regions. Comparing Holsteins with differing feed efficiency (FE), SVs unique to high FE were linked to energy metabolism and olfactory receptors, while those specific to low FE were associated with material transport. We constructed Holstein/Jersey pangenome graphs with 148,598/105,875 nodes and 208,891/147,990 edges, representing 47,028/37,137 biallelic and multi-allelic events, and 63.75/42.34 Mb of novel sequence. We observed SV count saturation with 20 Holsteins, while adding Jerseys significantly increased the SV count, highlighting breed-specific SV events. CONCLUSION Our long-read data and SV catalogs are valuable resources, revealing that the cattle genome is more complex than previously thought.
Collapse
Affiliation(s)
- Yahui Gao
- State Key Laboratory of Swine and Poultry Breeding Industry, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; Animal Genomics and Improvement Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, United States Department of Agriculture, Beltsville, MD 20705, USA; Department of Animal and Avian Sciences, University of Maryland, College Park, MD 20742, USA.
| | - Liu Yang
- Animal Genomics and Improvement Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, United States Department of Agriculture, Beltsville, MD 20705, USA; Department of Animal and Avian Sciences, University of Maryland, College Park, MD 20742, USA.
| | - Kristen Kuhn
- USDA, ARS, U.S. Meat Animal Research Center (USMARC), Clay Center, NE, USA.
| | - Wenli Li
- US Dairy Forage Research Center, USDA-ARS, Madison, WI, USA.
| | - Geoffrey Zanton
- US Dairy Forage Research Center, USDA-ARS, Madison, WI, USA.
| | - Mary Bowman
- Animal Genomics and Improvement Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, United States Department of Agriculture, Beltsville, MD 20705, USA.
| | - Pengju Zhao
- Hainan Institute, Zhejiang University, Yongyou Industry Park, Yazhou Bay Sci-Tech City, Sanya 572000, China.
| | - Yang Zhou
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China.
| | - Lingzhao Fang
- Quantitative Genetics and Genomics (QGG), Aarhus University, Aarhus, Denmark.
| | - John B Cole
- Council on Dairy Cattle Breeding, 4201 Northview Dr, Bowie, MD 20716, USA; Department of Animal Sciences, Donald Henry Barron Reproductive and Perinatal Biology Research Program, and the Genetics Institute, University of Florida, Gainesville, FL 32611-0910, USA; Department of Animal Science, North Carolina State University, Raleigh, NC 27695-7621, USA.
| | - Benjamin D Rosen
- Animal Genomics and Improvement Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, United States Department of Agriculture, Beltsville, MD 20705, USA.
| | - Li Ma
- Department of Animal and Avian Sciences, University of Maryland, College Park, MD 20742, USA.
| | - Congjun Li
- Animal Genomics and Improvement Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, United States Department of Agriculture, Beltsville, MD 20705, USA.
| | - Ransom L Baldwin
- Animal Genomics and Improvement Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, United States Department of Agriculture, Beltsville, MD 20705, USA.
| | - Curtis P Van Tassell
- Animal Genomics and Improvement Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, United States Department of Agriculture, Beltsville, MD 20705, USA.
| | - Zhe Zhang
- State Key Laboratory of Swine and Poultry Breeding Industry, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China.
| | - Timothy P L Smith
- USDA, ARS, U.S. Meat Animal Research Center (USMARC), Clay Center, NE, USA.
| | - George E Liu
- Animal Genomics and Improvement Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, United States Department of Agriculture, Beltsville, MD 20705, USA.
| |
Collapse
|
2
|
Mahmoud M, Agustinho DP, Sedlazeck FJ. A Hitchhiker's Guide to long-read genomic analysis. Genome Res 2025; 35:545-558. [PMID: 40228901 PMCID: PMC12047252 DOI: 10.1101/gr.279975.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/16/2025]
Abstract
Over the past decade, long-read sequencing has evolved into a pivotal technology for uncovering the hidden and complex regions of the genome. Significant cost efficiency, scalability, and accuracy advancements have driven this evolution. Concurrently, novel analytical methods have emerged to harness the full potential of long reads. These advancements have enabled milestones such as the first fully completed human genome, enhanced identification and understanding of complex genomic variants, and deeper insights into the interplay between epigenetics and genomic variation. This mini-review provides a comprehensive overview of the latest developments in long-read DNA sequencing analysis, encompassing reference-based and de novo assembly approaches. We explore the entire workflow, from initial data processing to variant calling and annotation, focusing on how these methods improve our ability to interpret a wide array of genomic variants. Additionally, we discuss the current challenges, limitations, and future directions in the field, offering a detailed examination of the state-of-the-art bioinformatics methods for long-read sequencing.
Collapse
Affiliation(s)
- Medhat Mahmoud
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Daniel P Agustinho
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA;
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| |
Collapse
|
3
|
Del Gobbo GF, Boycott KM. The additional diagnostic yield of long-read sequencing in undiagnosed rare diseases. Genome Res 2025; 35:559-571. [PMID: 39900460 PMCID: PMC12047273 DOI: 10.1101/gr.279970.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2025]
Abstract
Long-read sequencing (LRS) is a promising technology positioned to study the significant proportion of rare diseases (RDs) that remain undiagnosed as it addresses many of the limitations of short-read sequencing, detecting and clarifying additional disease-associated variants that may be missed by the current standard diagnostic workflow for RDs. Some key areas where additional diagnostic yields may be realized include: (1) detection and resolution of structural variants (SVs); (2) detection and characterization of tandem repeat expansions; (3) coverage of regions of high sequence similarity; (4) variant phasing; (5) the use of de novo genome assemblies for reference-based or graph genome variant detection; and (6) epigenetic and transcriptomic evaluations. Examples from over 50 studies support that the main areas of added diagnostic yield currently lie in SV detection and characterization, repeat expansion assessment, and phasing (with or without DNA methylation information). Several emerging studies applying LRS in cohorts of undiagnosed RDs also demonstrate that LRS can boost diagnostic yields following negative standard-of-care clinical testing and provide an added yield of 7%-17% following negative short-read genome sequencing. With this evidence of improved diagnostic yield, we discuss the incorporation of LRS into the diagnostic care pathway for undiagnosed RDs, including current challenges and considerations, with the ultimate goal of ending the diagnostic odyssey for countless individuals with RDs.
Collapse
Affiliation(s)
- Giulia F Del Gobbo
- Children's Hospital of Eastern Ontario Research Institute, University of Ottawa, Ottawa, Ontario, Canada K1H 5B2
| | - Kym M Boycott
- Children's Hospital of Eastern Ontario Research Institute, University of Ottawa, Ottawa, Ontario, Canada K1H 5B2;
- Department of Genetics, Children's Hospital of Eastern Ontario, Ottawa, Ontario, Canada K1H 8L1
| |
Collapse
|
4
|
Chen H, Xu S. Population genomics advances in frontier ethnic minorities in China. SCIENCE CHINA. LIFE SCIENCES 2025; 68:961-973. [PMID: 39643831 DOI: 10.1007/s11427-024-2659-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 06/18/2024] [Indexed: 12/09/2024]
Abstract
China, with its large geographic span, possesses rich genetic diversity across vast frontier regions in addition to the Han Chinese majority. Importantly, demographic events and various natural and cultural environments in Chinese frontier regions have shaped the genomic diversity of ethnic minorities via local adaptations. Thus, insights into the genetic diversity and adaptive evolution of these under-represented ethnic groups are crucial for understanding evolutionary scenarios and biomedical implications in East Asian populations. Here, we focus on ethnic minorities in Chinese frontier regions and review research advances regarding genomic diversity, genetic structure, population history, genetic admixture, and local adaptation. We first provide an overview of the extensive genetic diversity across populations in different Chinese frontier regions. Next, we summarize research progress regarding genetic ancestry, demographic history, the adaptive process, and the archaic identification of multiple ethnic minorities in different Chinese frontier regions. Finally, we discuss the gaps and opportunities in genomic studies of Chinese populations and the need for a more comprehensive understanding of genomic diversity and the evolution of populations of East Asian ancestry in the post-genomic era.
Collapse
Affiliation(s)
- Hao Chen
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Shuhua Xu
- Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, 200438, China.
| |
Collapse
|
5
|
Smaruj PN, Xiao Y, Fudenberg G. Recipes and ingredients for deep learning models of 3D genome folding. Curr Opin Genet Dev 2025; 91:102308. [PMID: 39862604 PMCID: PMC11867851 DOI: 10.1016/j.gde.2024.102308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Revised: 12/19/2024] [Accepted: 12/31/2024] [Indexed: 01/27/2025]
Abstract
Three-dimensional genome folding plays roles in gene regulation and disease. In this review, we compare and contrast recent deep learning models for predicting genome contact maps. We survey preprocessing, architecture, training, evaluation, and interpretation methods, highlighting the capabilities and limitations of different models. In each area, we highlight challenges, opportunities, and potential future directions for genome-folding models.
Collapse
Affiliation(s)
- Paulina N Smaruj
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Yao Xiao
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Geoffrey Fudenberg
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
6
|
Hatchell KE, Poll SR, Russell EM, Williams TJ, Ellsworth RE, Facio FM, Aguilar S, Esplin ED, Popejoy AB, Nussbaum RL, Aradhya S. Experience using conventional compared to ancestry-based population descriptors in clinical genomics laboratories. Am J Hum Genet 2025; 112:481-491. [PMID: 39884281 PMCID: PMC11947177 DOI: 10.1016/j.ajhg.2025.01.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Revised: 01/04/2025] [Accepted: 01/06/2025] [Indexed: 02/01/2025] Open
Abstract
Various scientific and professional groups, including the American Medical Association (AMA), American Society of Human Genetics (ASHG), American College of Medical Genetics (ACMG), and the National Academies of Sciences, Engineering, and Medicine (NASEM), have appropriately clarified that certain population descriptors, such as race and ethnicity, are social and cultural constructs with no basis in genetics. Nevertheless, these conventional population descriptors are routinely collected during the course of clinical genetic testing and may be used to interpret test results. Experts who have examined the use of population descriptors, both conventional and ancestry based, in human genetics and genomics have offered guidance on using these descriptors in research but not in clinical laboratory settings. This perspective piece is based on a decade of experience in a clinical genomics laboratory and provides insight into the relevance of conventional and ancestry-based population descriptors for clinical genetic testing, reporting, and clinical research on aggregated data. As clinicians, laboratory geneticists, genetic counselors, and researchers, we describe real-world experiences collecting conventional population descriptors in the course of clinical genetic testing and expose challenges in ensuring clarity and consistency in the use of population descriptors. Current practices in clinical genomics laboratories that are influenced by population descriptors are identified and discussed through case examples. In relation to this, we describe specific types of clinical research projects in which population descriptors were used and helped derive useful insights related to practicing and improving genomic medicine.
Collapse
Affiliation(s)
- Kathryn E Hatchell
- Labcorp Genetics, Inc. (formerly Invitae Corp.), San Francisco, CA, USA.
| | - Sarah R Poll
- Labcorp Genetics, Inc. (formerly Invitae Corp.), San Francisco, CA, USA
| | - Emily M Russell
- Labcorp Genetics, Inc. (formerly Invitae Corp.), San Francisco, CA, USA
| | - Trevor J Williams
- Labcorp Genetics, Inc. (formerly Invitae Corp.), San Francisco, CA, USA
| | | | - Flavia M Facio
- Labcorp Genetics, Inc. (formerly Invitae Corp.), San Francisco, CA, USA
| | - Sienna Aguilar
- Labcorp Genetics, Inc. (formerly Invitae Corp.), San Francisco, CA, USA
| | - Edward D Esplin
- Labcorp Genetics, Inc. (formerly Invitae Corp.), San Francisco, CA, USA
| | - Alice B Popejoy
- Department of Public Health Sciences (Epidemiology Division), University of California Davis School of Medicine, Davis, CA, USA; UCDavis Health Comprehensive Cancer Center, University of California Davis Medical Center, Sacramento, CA, USA
| | - Robert L Nussbaum
- Department of Pediatrics, University of California, San Francisco, San Francisco, CA, USA
| | - Swaroop Aradhya
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
| |
Collapse
|
7
|
Nassir N, A Almarri M, Akter H, Hassan Khansaheb H, Uddin KMF, Abou Tayoun A, Du Plessis SS, Haber M, Alsheikh-Ali A, Uddin M. Advancing clinical genomics with Middle Eastern and South Asian pangenomes. Nat Med 2025; 31:725-727. [PMID: 40038508 DOI: 10.1038/s41591-025-03544-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Affiliation(s)
- Nasna Nassir
- Center for Applied and Translational Genomics (CATG), Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai Health, Dubai, United Arab Emirates
| | - Mohamed A Almarri
- College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai Health, Dubai, United Arab Emirates
- Genome Center, Department of Forensic Science and Criminology, Dubai Police GHQ, Dubai, United Arab Emirates
| | - Hosneara Akter
- Genetics and Genomic Medicine Centre (GGMC), NeuroGen Healthcare, Dhaka, Bangladesh
- Laboratory of Population Genetics, Department of Biochemistry and Molecular Biology, University of Dhaka, Dhaka, Bangladesh
| | - Hamda Hassan Khansaheb
- College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai Health, Dubai, United Arab Emirates
| | - K M Furkan Uddin
- Genetics and Genomic Medicine Centre (GGMC), NeuroGen Healthcare, Dhaka, Bangladesh
| | - Ahmad Abou Tayoun
- Al Jalila Genomics Center of Excellence, Al Jalila Children's Specialty Hospital, Dubai Health, Dubai, United Arab Emirates
- Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai Health, Dubai, United Arab Emirates
| | - Stefan S Du Plessis
- Center for Applied and Translational Genomics (CATG), Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai Health, Dubai, United Arab Emirates
- College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai Health, Dubai, United Arab Emirates
| | - Marc Haber
- Cancer and Genomic Sciences, College of Medicine and Health, University of Birmingham Dubai, Dubai, United Arab Emirates
| | - Alawi Alsheikh-Ali
- Center for Applied and Translational Genomics (CATG), Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai Health, Dubai, United Arab Emirates.
- College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai Health, Dubai, United Arab Emirates.
- Dubai Health, Dubai, United Arab Emirates.
| | - Mohammed Uddin
- Center for Applied and Translational Genomics (CATG), Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai Health, Dubai, United Arab Emirates.
- College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai Health, Dubai, United Arab Emirates.
- GenomeArc, Mississauga, Ontario, Canada.
| |
Collapse
|
8
|
Palma-Martínez MJ, Posadas-García YS, Shaukat A, López-Ángeles BE, Sohail M. Evolution, genetic diversity, and health. Nat Med 2025; 31:751-761. [PMID: 40055519 DOI: 10.1038/s41591-025-03558-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 02/03/2025] [Indexed: 03/21/2025]
Abstract
Human genetic diversity in today's world has been shaped by evolutionary history, demographic shifts and environmental exposures, influencing complex traits, disease susceptibility and drug responses. Capturing this diversity is essential for advancing precision medicine and promoting equitable healthcare. Despite the great progress achieved with initiatives such as the human Pangenome and large biobanks that aim for a better representation of human diversity, important challenges remain. In this Perspective, we discuss the importance of diversity in clinical genomics through an evolutionary lens. We highlight progress and challenges and outline key clinical applications of diverse genetic data. We argue that diversifying both datasets and methodologies-integrating ancestral and environmental factors-is crucial for fully understanding the genetic basis of human health and disease.
Collapse
Affiliation(s)
- María J Palma-Martínez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
| | | | - Amara Shaukat
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
| | - Brenda E López-Ángeles
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
| | - Mashaal Sohail
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México.
| |
Collapse
|
9
|
Cline N, Merlo D, Frater S, Pollock NR, Mayor NP, Turner TR, Walsh L, Vivers S, Norman PJ. The Case of a Missing HLA-B Gene. HLA 2025; 105:e70114. [PMID: 40117098 PMCID: PMC11932453 DOI: 10.1111/tan.70114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2024] [Revised: 01/30/2025] [Accepted: 02/18/2025] [Indexed: 03/23/2025]
Abstract
The Major Histocompatibility Complex (MHC) of human chromosome 6 contains multiple genes critical for immunity. The exceptional polymorphism of this genomic region that establishes and maintains immune diversity can be technically challenging to characterise and analyse. In this study, we present a family where the mother and one of her children have no HLA-B allele in common, implying the absence of HLA-B from the maternal haplotype. Homozygosity of the mother and child was confirmed using three independent PCR-based methods and high throughput DNA sequencing. Through probe-based MHC region enrichment, sequencing, and read mapping, we located the breakpoints of a large (36.5 kbp) deletion encompassing the entire HLA-B gene. Accordingly, the deletion was present on the maternal haplotype and transmitted to the child. This study demonstrates strategies for locating large deletions in complex genomic regions and highlights the dynamic nature of MHC structure and variation.
Collapse
Affiliation(s)
- Noah Cline
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, Colorado, USA
- Department of Immunology and Microbiology, University of Colorado School of Medicine, Aurora, Colorado, USA
| | - Dario Merlo
- Anthony Nolan Research Institute, Royal Free Hospital, London, UK
| | - Sandra Frater
- Anthony Nolan Research Institute, Royal Free Hospital, London, UK
| | - Nicholas R. Pollock
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, Colorado, USA
- Department of Immunology and Microbiology, University of Colorado School of Medicine, Aurora, Colorado, USA
| | - Neema P. Mayor
- Anthony Nolan Research Institute, Royal Free Hospital, London, UK
- UCL Cancer Institute, Royal Free Campus, London, UK
| | - Thomas R. Turner
- Anthony Nolan Research Institute, Royal Free Hospital, London, UK
- UCL Cancer Institute, Royal Free Campus, London, UK
| | - Lisa Walsh
- Anthony Nolan Research Institute, Royal Free Hospital, London, UK
| | - Sharon Vivers
- Anthony Nolan Research Institute, Royal Free Hospital, London, UK
- UCL Cancer Institute, Royal Free Campus, London, UK
| | - Paul J. Norman
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, Colorado, USA
- Department of Immunology and Microbiology, University of Colorado School of Medicine, Aurora, Colorado, USA
| |
Collapse
|
10
|
Jana U, Rodriguez OL, Lees W, Engelbrecht E, Vanwinkle Z, Peres A, Gibson WS, Shields K, Schultze S, Dorgham A, Emery M, Deikus G, Sebra R, Eichler EE, Yaari G, Smith ML, Watson CT. The human immunoglobulin heavy chain constant gene locus is enriched for large complex structural variants and coding polymorphisms that vary in frequency among human populations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.12.634878. [PMID: 39990387 PMCID: PMC11844466 DOI: 10.1101/2025.02.12.634878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/25/2025]
Abstract
The immunoglobulin heavy chain constant (IGHC) domain of antibodies (Ab) is responsible for effector functions critical to Ab mediated immunity. In humans, this domain is encoded by genes within the IGHC locus, where descriptions of genomic diversity remain incomplete. To address this, we utilized long-read genomic datasets to build a high-quality IGHC haplotype/variant catalog from 105 individuals of diverse ancestry, and developed a high-throughput approach for targeted long-read IGHC locus sequencing and assembly. From locally phased assemblies, we discovered previously uncharacterized single nucleotide variants (SNV) and complex structural variants (SVs, n=7), as well as novel genes and alleles. Of the 262 identified IGHC coding alleles, 235 (89.6%) were undocumented. SNV, SV, and gene allele/genotype frequencies revealed significant population differentiation, including; (i) hundreds of SNVs in African and East Asian populations exceeding fixation index (FST) of 0.3, (ii) and an IGHG4 haplotype carrying specific coding variants uniquely enriched in East and South Asian populations. Our results illuminate missing signatures of haplotype diversity in the IGHC locus, including evidence of natural selection, and establish a new foundation for investigating IGHC germline variation and its role in Ab function and disease.
Collapse
Affiliation(s)
- Uddalok Jana
- Department of Biochemistry and Molecular Genetics, University of Louisville School of Medicine, Louisville, KY, USA
| | - Oscar L. Rodriguez
- Department of Biochemistry and Molecular Genetics, University of Louisville School of Medicine, Louisville, KY, USA
| | - William Lees
- Department of Biochemistry and Molecular Genetics, University of Louisville School of Medicine, Louisville, KY, USA
- Bioengineering Program, Faculty of Engineering, Bar-Ilan University, Ramat Gan, 5290002, Israel
| | - Eric Engelbrecht
- Department of Biochemistry and Molecular Genetics, University of Louisville School of Medicine, Louisville, KY, USA
| | - Zach Vanwinkle
- Department of Biochemistry and Molecular Genetics, University of Louisville School of Medicine, Louisville, KY, USA
| | - Ayelet Peres
- Bioengineering Program, Faculty of Engineering, Bar-Ilan University, Ramat Gan, 5290002, Israel
| | - William S. Gibson
- Department of Biochemistry and Molecular Genetics, University of Louisville School of Medicine, Louisville, KY, USA
- Vaccine Research Center, National Institute of Allergy and Infectious Disease, National Institute of Health, Bethesda, MD
| | - Kaitlyn Shields
- Department of Biochemistry and Molecular Genetics, University of Louisville School of Medicine, Louisville, KY, USA
| | - Steven Schultze
- Department of Biochemistry and Molecular Genetics, University of Louisville School of Medicine, Louisville, KY, USA
| | - Abdullah Dorgham
- Department of Biochemistry and Molecular Genetics, University of Louisville School of Medicine, Louisville, KY, USA
| | - Matthew Emery
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Gintaras Deikus
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Robert Sebra
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Gur Yaari
- Bioengineering Program, Faculty of Engineering, Bar-Ilan University, Ramat Gan, 5290002, Israel
| | - Melissa L. Smith
- Department of Biochemistry and Molecular Genetics, University of Louisville School of Medicine, Louisville, KY, USA
| | - Corey T. Watson
- Department of Biochemistry and Molecular Genetics, University of Louisville School of Medicine, Louisville, KY, USA
| |
Collapse
|
11
|
Villalba A, Smajdor A, Brassington I, Cutas D. The ethics of synthetic DNA. JOURNAL OF MEDICAL ETHICS 2024:jme-2024-110124. [PMID: 39567177 DOI: 10.1136/jme-2024-110124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Accepted: 10/13/2024] [Indexed: 11/22/2024]
Abstract
In this paper, we discuss the ethical concerns that may arise from the synthesis of human DNA. To date, only small stretches of DNA have been constructed, but the prospect of generating human genomes is becoming feasible. At the same time, the significance of genes for identity, health and reproduction is coming under increased scrutiny. We examine the implications of DNA synthesis and its impact on debates over the relationship with our DNA and the ownership of our genes, its potential to disrupt common understandings of reproduction and privacy, and the way in which synthetic DNA challenges traditional associations between genes and identity. We explore the degree to which synthetic DNA may further undermine overgeneticised accounts of identity, health, reproduction, parenthood and privacy that are prevalent in the public domain and in some areas of policy-making. While avoiding making normative claims of our own, we conclude that there is a need for reflection on the ethical implications of these developing technologies before they are on us.
Collapse
Affiliation(s)
- Adrian Villalba
- Université Paris Cité, Paris, France
- University of Granada, Granada, Spain
| | | | | | | |
Collapse
|
12
|
Kobayashi Y, Chen E, Facio FM, Metz H, Poll SR, Swartzlander D, Johnson B, Aradhya S. Clinical Variant Reclassification in Hereditary Disease Genetic Testing. JAMA Netw Open 2024; 7:e2444526. [PMID: 39504018 PMCID: PMC11541632 DOI: 10.1001/jamanetworkopen.2024.44526] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 09/17/2024] [Indexed: 11/08/2024] Open
Abstract
Importance Because accurate and consistent classification of DNA sequence variants is fundamental to germline genetic testing, understanding patterns of initial variant classification (VC) and subsequent reclassification from large-scale, empirical data can help improve VC methods, promote equity among race, ethnicity, and ancestry (REA) groups, and provide insights to inform clinical practice. Objectives To measure the degree to which initial VCs met certainty thresholds set by professional guidelines and quantify the rates of, the factors associated with, and the impact of reclassification among more than 2 million variants. Design, Setting, and Participants This cohort study used clinical multigene panel and exome sequencing data from diagnostic testing for hereditary disorders, carrier screening, or preventive genetic screening from individuals for whom genetic testing was performed between January 1, 2015, and June 30, 2023. Exposure DNA variants were classified into 1 of 5 categories: benign, likely benign, variant of uncertain significance (VUS), likely pathogenic, or pathogenic. Main Outcomes and Measures The main outcomes were accuracy of classifications, rates and directions of reclassifications, evidence contributing to reclassifications, and their impact across different clinical areas and REA groups. One-way analysis of variance followed by post hoc pairwise Tukey honest significant difference tests were used to analyze differences among means, and pairwise Pearson χ2 tests with Bonferroni corrections were used to compare categorical variables among groups. Results The cohort comprised 3 272 035 individuals (median [range] age, 44 [0-89] years; 2 240 506 female [68.47%] and 1 030 729 male [31.50%]; 216 752 Black [6.62%]; 336 414 Hispanic [10.28%]; 1 804 273 White [55.14%]). Among 2 051 736 variants observed over 8 years in this cohort, 94 453 (4.60%) were reclassified. Some variants were reclassified more than once, resulting in 105 172 total reclassification events. The majority (64 752 events [61.65%]) were changes from VUS to either likely benign, benign, likely pathogenic, or pathogenic categories. An additional 37.66% of reclassifications (39 608 events) were gains in classification certainty to terminal categories (ie, likely benign to benign and likely pathogenic to pathogenic). Only a small fraction (663 events [0.63%]) moved toward less certainty, or very rarely (61 events [0.06%]) were classification reversals. When normalized by the number of individuals tested, VUS reclassification rates were higher among specific underrepresented REA populations (Ashkenazi Jewish, Asian, Black, Hispanic, Pacific Islander, and Sephardic Jewish). Approximately one-half of VUS reclassifications (37 074 of 64 840 events [57.18%]) resulted from improved use of data from computational modeling. Conclusions and Relevance In this cohort study of individuals undergoing genetic testing, the empirically estimated accuracy of pathogenic, likely pathogenic, benign, and likely benign classifications exceeded the certainty thresholds set by current VC guidelines, suggesting the need to reevaluate definitions of these classifications. The relative contribution of various strategies to resolve VUS, including emerging machine learning-based computational methods, RNA analysis, and cascade family testing, provides useful insights that can be applied toward further improving VC methods, reducing the rate of VUS, and generating more definitive results for patients.
Collapse
Affiliation(s)
- Yuya Kobayashi
- Labcorp Genetics Inc (formerly Invitae Corporation), San Francisco, California
| | - Elaine Chen
- Invitae Corporation (now part of Labcorp Genetics), San Francisco, California
- Now with Midi Health, Los Altos Hills, California
| | - Flavia M. Facio
- Labcorp Genetics Inc (formerly Invitae Corporation), San Francisco, California
| | - Hillery Metz
- Labcorp Genetics Inc (formerly Invitae Corporation), San Francisco, California
| | - Sarah R. Poll
- Labcorp Genetics Inc (formerly Invitae Corporation), San Francisco, California
| | - Dan Swartzlander
- Labcorp Genetics Inc (formerly Invitae Corporation), San Francisco, California
| | - Britt Johnson
- Invitae Corporation (now part of Labcorp Genetics), San Francisco, California
- Now with GeneDx, Stamford, Connecticut
| | - Swaroop Aradhya
- Invitae Corporation (now part of Labcorp Genetics), San Francisco, California
- Now with Illumina, San Diego, California
- Department of Pathology, Stanford University, Stanford, California
| |
Collapse
|
13
|
Adams PE, Thies JL, Sutton JM, Millwood JD, Caldwell GA, Caldwell KA, Fierst JL. Identifying transgene insertions in Caenorhabditis elegans genomes with Oxford Nanopore sequencing. PeerJ 2024; 12:e18100. [PMID: 39285918 PMCID: PMC11404476 DOI: 10.7717/peerj.18100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Accepted: 08/26/2024] [Indexed: 09/19/2024] Open
Abstract
Genetically modified organisms are commonly used in disease research and agriculture but the precise genomic alterations underlying transgenic mutations are often unknown. The position and characteristics of transgenes, including the number of independent insertions, influences the expression of both transgenic and wild-type sequences. We used long-read, Oxford Nanopore Technologies (ONT) to sequence and assemble two transgenic strains of Caenorhabditis elegans commonly used in the research of neurodegenerative diseases: BY250 (pPdat-1::GFP) and UA44 (GFP and human α-synuclein), a model for Parkinson's research. After scaffolding to the reference, the final assembled sequences were ∼102 Mb with N50s of 17.9 Mb and 18.0 Mb, respectively, and L90s of six contiguous sequences, representing chromosome-level assemblies. Each of the assembled sequences contained more than 99.2% of the Nematoda BUSCO genes found in the C. elegans reference and 99.5% of the annotated C. elegans reference protein-coding genes. We identified the locations of the transgene insertions and confirmed that all transgene sequences were inserted in intergenic regions, leaving the organismal gene content intact. The transgenic C. elegans genomes presented here will be a valuable resource for Parkinson's research as well as other neurodegenerative diseases. Our work demonstrates that long-read sequencing is a fast, cost-effective way to assemble genome sequences and characterize mutant lines and strains.
Collapse
Affiliation(s)
- Paula E Adams
- Department of Biological Sciences, Auburn University, Auburn, AL, United States of America
- Department of Biological Sciences, University of Alabama - Tuscaloosa, Tuscaloosa, AL, United States of America
| | - Jennifer L Thies
- Department of Biological Sciences, University of Alabama - Tuscaloosa, Tuscaloosa, AL, United States of America
- Curriculum in Toxicology and Environmental Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States of America
| | - John M Sutton
- Department of Biological Sciences, University of Alabama - Tuscaloosa, Tuscaloosa, AL, United States of America
- Absci, Vancouver, WA, United States of America
| | - Joshua D Millwood
- Department of Biological Sciences, University of Alabama - Tuscaloosa, Tuscaloosa, AL, United States of America
- Department of Biological and Environmental Sciences, University of West Alabama, Livingston, AL, United States of America
| | - Guy A Caldwell
- Department of Biological Sciences, University of Alabama - Tuscaloosa, Tuscaloosa, AL, United States of America
| | - Kim A Caldwell
- Department of Biological Sciences, University of Alabama - Tuscaloosa, Tuscaloosa, AL, United States of America
| | - Janna L Fierst
- Department of Biological Sciences, Florida International University, Miami, FL, United States of America
- Biomolecular Sciences Institute, Florida International University, Miami, FL, United States of America
| |
Collapse
|
14
|
L Rocha J, Lou RN, Sudmant PH. Structural variation in humans and our primate kin in the era of telomere-to-telomere genomes and pangenomics. Curr Opin Genet Dev 2024; 87:102233. [PMID: 39042999 PMCID: PMC11695101 DOI: 10.1016/j.gde.2024.102233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Revised: 07/02/2024] [Accepted: 07/05/2024] [Indexed: 07/25/2024]
Abstract
Structural variants (SVs) account for the majority of base pair differences both within and between primate species. However, our understanding of inter- and intra-species SV has been historically hampered by the quality of draft primate genomes and the absence of genome resources for key taxa. Recently, advances in long-read sequencing and genome assembly have begun to radically reshape our understanding of SVs. Two landmark achievements include the publication of a human telomere-to-telomere (T2T) genome as well as the development of the first human pangenome reference. In this review, we first look back to the major works laying the foundation for these projects. We then examine the ways in which T2T genome assemblies and pangenomes are transforming our understanding of and approach to primate SV. Finally, we discuss what the future of primate SV research may look like in the era of T2T genomes and pangenomics.
Collapse
Affiliation(s)
- Joana L Rocha
- Department of Integrative Biology, University of California, Berkeley, Berkeley, USA. https://twitter.com/@joanocha
| | - Runyang N Lou
- Department of Integrative Biology, University of California, Berkeley, Berkeley, USA. https://twitter.com/@NicolasLou10
| | - Peter H Sudmant
- Department of Integrative Biology, University of California, Berkeley, Berkeley, USA; Center for Computational Biology, University of California, Berkeley, Berkeley, USA.
| |
Collapse
|
15
|
Taylor DJ, Eizenga JM, Li Q, Das A, Jenike KM, Kenny EE, Miga KH, Monlong J, McCoy RC, Paten B, Schatz MC. Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References. Annu Rev Genomics Hum Genet 2024; 25:77-104. [PMID: 38663087 PMCID: PMC11451085 DOI: 10.1146/annurev-genom-021623-081639] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/29/2024]
Abstract
The Human Genome Project was an enormous accomplishment, providing a foundation for countless explorations into the genetics and genomics of the human species. Yet for many years, the human genome reference sequence remained incomplete and lacked representation of human genetic diversity. Recently, two major advances have emerged to address these shortcomings: complete gap-free human genome sequences, such as the one developed by the Telomere-to-Telomere Consortium, and high-quality pangenomes, such as the one developed by the Human Pangenome Reference Consortium. Facilitated by advances in long-read DNA sequencing and genome assembly algorithms, complete human genome sequences resolve regions that have been historically difficult to sequence, including centromeres, telomeres, and segmental duplications. In parallel, pangenomes capture the extensive genetic diversity across populations worldwide. Together, these advances usher in a new era of genomics research, enhancing the accuracy of genomic analysis, paving the path for precision medicine, and contributing to deeper insights into human biology.
Collapse
Affiliation(s)
- Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA; , ,
| | - Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, California, USA; , ,
| | - Qiuhui Li
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA; ,
| | - Arun Das
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA; ,
| | - Katharine M Jenike
- Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA;
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA;
| | - Karen H Miga
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
- Genomics Institute, University of California, Santa Cruz, California, USA; , ,
| | - Jean Monlong
- Institut de Recherche en Santé Digestive, Université de Toulouse, INSERM, INRA, ENVT, UPS, Toulouse, France;
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA; , ,
| | - Benedict Paten
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
- Genomics Institute, University of California, Santa Cruz, California, USA; , ,
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA; ,
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA; , ,
| |
Collapse
|
16
|
Sarawad A, Hosagoudar S, Parvatikar P. Pan-genomics: Insight into the Functional Genome, Applications, Advancements, and Challenges. Curr Genomics 2024; 26:2-14. [PMID: 39911277 PMCID: PMC11793047 DOI: 10.2174/0113892029311541240627111506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 04/30/2024] [Accepted: 05/29/2024] [Indexed: 02/07/2025] Open
Abstract
A pan-genome is a compilation of the common and unique genomes found in a given species. It incorporates the genetic information from all of the genomes sampled, producing a big and diverse set of genetic material. Pan-genomic analysis has various advantages over typical genomics research. It creates a vast and varied spectrum of genetic material by combining the genetic data from all the sampled genomes. Comparing pan-genomics analysis to conventional genomic research, there are a number of benefits. Although the most recent era of pan-genomic studies has used cutting-edge sequencing technology to shed fresh light on biological variety and improvement, the potential uses of pan-genomics in improvement have not yet been fully realized. Pan-genome research in various organisms has demonstrated that missing genetic components and the detection of significant Structural Variants (SVs) can be investigated using pan-genomic methods. Many individual-specific sequences have been linked to biological adaptability, phenotypic, and key economic attributes. This study aims to focus on how pangenome analysis uncovers genetic differences in various organisms, including human, and their effects on phenotypes, as well as how this might help us comprehend the diversity of species. The review also concentrated on potential problems and the prospects for future pangenome research.
Collapse
Affiliation(s)
- Akansha Sarawad
- Department of Biotechnology, Applied School of Science and Technology, BLDE (DU), Vijayapura, Karnataka, India
| | - Spoorti Hosagoudar
- Department of Biotechnology, Applied School of Science and Technology, BLDE (DU), Vijayapura, Karnataka, India
| | - Prachi Parvatikar
- Department of Biotechnology, Applied School of Science and Technology, BLDE (DU), Vijayapura, Karnataka, India
| |
Collapse
|
17
|
Tavakoli N, Gibney D, Aluru S. GraphSlimmer: Preserving Read Mappability with the Minimum Number of Variants. J Comput Biol 2024; 31:616-637. [PMID: 38990757 DOI: 10.1089/cmb.2024.0601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024] Open
Abstract
Modern genomic datasets, like those generated under the 1000 Genome Project, contain millions of variants belonging to known haplotypes. Although these datasets are more representative than a single reference sequence and can alleviate issues like reference bias, they are significantly more computationally burdensome to work with, often involving large-indexed genome graph data structures for tasks such as read mapping. The construction, preprocessing, and mapping algorithms can require substantial computational resources depending on the size of these variant sets. Moreover, the accuracy of mapping algorithms has been shown to decrease when working with complete variant sets. Therefore, a drastically reduced set of variants that preserves important properties of the original set is desirable. This work provides a technique for finding a minimal subset of variants S such that for given parameters α and δ, all substrings up to length α in the haplotypes are guaranteed to be still alignable to the appropriate locations with either Hamming or edit distance at most δ, using only S . Our contributions include showing the NP-hardness and inapproximability of these optimization problems and providing Integer Linear Programming (ILP) formulations. Our edit distance ILP formulation carefully decomposes the problem according to variant locations, which allows it to scale to support all of chromosome 22's variants from the 1000 Genome Project. Our experiments also demonstrate a significant reduction in the number of variants. For example, for moderately long reads, e.g., α = 1000, over 75% of the variants can be removed while preserving read mappability with edit distance at most one.
Collapse
Affiliation(s)
- Neda Tavakoli
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Gxeorgia, USA
| | - Daniel Gibney
- Department of Computer Science, University of Texas at Dallas, Richardson, Texas, USA
| | - Srinivas Aluru
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Gxeorgia, USA
| |
Collapse
|
18
|
Mikhaylova V, Rzepka M, Kawamura T, Xia Y, Chang PL, Zhou S, Paasch A, Pham L, Modi N, Yao L, Perez-Agustin A, Pagans S, Boles TC, Lei M, Wang Y, Garcia-Bassets I, Chen Z. Targeted phasing of 2-200 kilobase DNA fragments with a short-read sequencer and a single-tube linked-read library method. Sci Rep 2024; 14:7988. [PMID: 38580715 PMCID: PMC10997766 DOI: 10.1038/s41598-024-58733-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 04/02/2024] [Indexed: 04/07/2024] Open
Abstract
In the human genome, heterozygous sites refer to genomic positions with a different allele or nucleotide variant on the maternal and paternal chromosomes. Resolving these allelic differences by chromosomal copy, also known as phasing, is achievable on a short-read sequencer when using a library preparation method that captures long-range genomic information. TELL-Seq is a library preparation that captures long-range genomic information with the aid of molecular identifiers (barcodes). The same barcode is used to tag the reads derived from the same long DNA fragment within a range of up to 200 kilobases (kb), generating linked-reads. This strategy can be used to phase an entire genome. Here, we introduce a TELL-Seq protocol developed for targeted applications, enabling the phasing of enriched loci of varying sizes, purity levels, and heterozygosity. To validate this protocol, we phased 2-200 kb loci enriched with different methods: CRISPR/Cas9-mediated excision coupled with pulse-field electrophoresis for the longest fragments, CRISPR/Cas9-mediated protection from exonuclease digestion for mid-size fragments, and long PCR for the shortest fragments. All selected loci have known clinical relevance: BRCA1, BRCA2, MLH1, MSH2, MSH6, APC, PMS2, SCN5A-SCN10A, and PKI3CA. Collectively, the analyses show that TELL-Seq can accurately phase 2-200 kb targets using a short-read sequencer.
Collapse
Affiliation(s)
| | - Madison Rzepka
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA
| | | | - Yu Xia
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA
| | - Peter L Chang
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA
| | | | - Amber Paasch
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA
| | - Long Pham
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA
| | - Naisarg Modi
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA
| | - Likun Yao
- Department of Medicine, University of California, San Diego, La Jolla, CA, 92093, USA
| | - Adrian Perez-Agustin
- Department of Medical Sciences, School of Medicine, University of Girona, Girona, Spain
| | - Sara Pagans
- Department of Medical Sciences, School of Medicine, University of Girona, Girona, Spain
| | | | - Ming Lei
- Universal Sequencing Technology Corp., Canton, MA, 02021, USA
| | - Yong Wang
- Universal Sequencing Technology Corp., Canton, MA, 02021, USA
| | | | - Zhoutao Chen
- Universal Sequencing Technology Corp., Carlsbad, CA, 92011, USA.
| |
Collapse
|
19
|
Hickey G, Monlong J, Ebler J, Novak AM, Eizenga JM, Gao Y, Marschall T, Li H, Paten B, Abel HJ, Antonacci-Fulton LL, Asri M, Baid G, Baker CA, Belyaeva A, Billis K, Bourque G, Buonaiuto S, Carroll A, Chaisson MJP, Chang PC, Chang XH, Cheng H, Chu J, Cody S, Colonna V, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Doerr D, Ebert P, Ebler J, Eichler EE, Eizenga JM, Fairley S, Fedrigo O, Felsenfeld AL, Feng X, Fischer C, Flicek P, Formenti G, Frankish A, Fulton RS, Gao Y, Garg S, Garrison E, Garrison NA, Giron CG, Green RE, Groza C, Guarracino A, Haggerty L, Hall IM, Harvey WT, Haukness M, Haussler D, Heumos S, Hickey G, Hoekzema K, Hourlier T, Howe K, Jain M, Jarvis ED, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Li H, Liao WW, Lu S, Lu TY, Lucas JK, Magalhães H, Marco-Sola S, Marijon P, Markello C, Marschall T, Martin FJ, McCartney A, McDaniel J, Miga KH, Mitchell MW, Monlong J, Mountcastle J, Munson KM, Mwaniki MN, Nattestad M, Novak AM, Nurk S, Olsen HE, Olson ND, Paten B, Pesout T, Phillippy AM, et alHickey G, Monlong J, Ebler J, Novak AM, Eizenga JM, Gao Y, Marschall T, Li H, Paten B, Abel HJ, Antonacci-Fulton LL, Asri M, Baid G, Baker CA, Belyaeva A, Billis K, Bourque G, Buonaiuto S, Carroll A, Chaisson MJP, Chang PC, Chang XH, Cheng H, Chu J, Cody S, Colonna V, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Doerr D, Ebert P, Ebler J, Eichler EE, Eizenga JM, Fairley S, Fedrigo O, Felsenfeld AL, Feng X, Fischer C, Flicek P, Formenti G, Frankish A, Fulton RS, Gao Y, Garg S, Garrison E, Garrison NA, Giron CG, Green RE, Groza C, Guarracino A, Haggerty L, Hall IM, Harvey WT, Haukness M, Haussler D, Heumos S, Hickey G, Hoekzema K, Hourlier T, Howe K, Jain M, Jarvis ED, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Li H, Liao WW, Lu S, Lu TY, Lucas JK, Magalhães H, Marco-Sola S, Marijon P, Markello C, Marschall T, Martin FJ, McCartney A, McDaniel J, Miga KH, Mitchell MW, Monlong J, Mountcastle J, Munson KM, Mwaniki MN, Nattestad M, Novak AM, Nurk S, Olsen HE, Olson ND, Paten B, Pesout T, Phillippy AM, Popejoy AB, Porubsky D, Prins P, Puiu D, Rautiainen M, Regier AA, Rhie A, Sacco S, Sanders AD, Schneider VA, Schultz BI, Shafin K, Sibbesen JA, Sirén J, Smith MW, Sofia HJ, Tayoun ANA, Thibaud-Nissen F, Tomlinson C, Tricomi FF, Villani F, Vollger MR, Wagner J, Walenz B, Wang T, Wood JMD, Zimin AV, Zook JM. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat Biotechnol 2024; 42:663-673. [PMID: 37165083 PMCID: PMC10638906 DOI: 10.1038/s41587-023-01793-w] [Show More Authors] [Citation(s) in RCA: 60] [Impact Index Per Article: 60.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Accepted: 04/18/2023] [Indexed: 05/12/2023]
Abstract
Pangenome references address biases of reference genomes by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but advances in long-read sequencing are leading to widely available, high-quality phased assemblies. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph's ability to represent variation at different scales. Here we present the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium. The method builds graphs containing all forms of genetic variation while allowing use of current mapping and genotyping tools. We measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods. We also demonstrate construction of a Drosophila melanogaster pangenome.
Collapse
Affiliation(s)
- Glenn Hickey
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
- These authors contributed equally: Glenn Hickey, Jean Monlong
| | - Jean Monlong
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
- These authors contributed equally: Glenn Hickey, Jean Monlong
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Adam M. Novak
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Jordan M. Eizenga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Yan Gao
- Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| | | | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | | | - Haley J. Abel
- Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, USA
| | | | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | | | - Carl A. Baker
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Guillaume Bourque
- Department of Human Genetics, McGill University, Montreal, QC, Canada
- Canadian Center for Computational Genomics, McGill University, Montreal, QC, Canada
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Silvia Buonaiuto
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
| | | | - Mark J. P. Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | | | - Xian H. Chang
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Haoyu Cheng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Justin Chu
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Sarah Cody
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Vincenza Colonna
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | | | - Robert M. Cook-Deegan
- Arizona State University, Barrett and O’Connor Washington Center, Washington, DC, USA
| | - Omar E. Cornejo
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Daniel Doerr
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Peter Ebert
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Jana Ebler
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Jordan M. Eizenga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Susan Fairley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam L. Felsenfeld
- National Institutes of Health (NIH)–National Human Genome Research Institute, Bethesda, MD, USA
| | - Xiaowen Feng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Christian Fischer
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Robert S. Fulton
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Yan Gao
- Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| | - Shilpa Garg
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Copenhagen, Denmark
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Nanibaa’ A. Garrison
- Institute for Society and Genetics, College of Letters and Science, University of California, Los Angeles, Los Angeles, CA, USA
- Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Carlos Garcia Giron
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Richard E. Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
- Dovetail Genomics, Scotts Valley, CA, USA
| | - Cristian Groza
- Quantitative Life Sciences, McGill University, Montreal, QC, Canada
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Ira M. Hall
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
| | - William T. Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - David Haussler
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, Germany
| | - Glenn Hickey
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
- These authors contributed equally: Glenn Hickey, Jean Monlong
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge, UK
| | - Miten Jain
- Northeastern University, Boston, MA, USA
| | - Erich D. Jarvis
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Hanlee P. Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Eimear E. Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Barbara A. Koenig
- Program in Bioethics and Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
| | | | - Jan O. Korbel
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Jennifer Kordosky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - HoJoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Alexandra P. Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Wen-Wei Liao
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
- Division of Biology and Biomedical Sciences, Washington University School of Medicine, St. Louis, MO, USA
| | - Shuangjia Lu
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
| | - Tsung-Yu Lu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Julian K. Lucas
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Hugo Magalhães
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
- Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Pierre Marijon
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Charles Markello
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Tobias Marschall
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Fergal J. Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Ann McCartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Karen H. Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | | | - Jean Monlong
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
- These authors contributed equally: Glenn Hickey, Jean Monlong
| | | | - Katherine M. Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | | | - Adam M. Novak
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Hugh E. Olsen
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Nathan D. Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Trevor Pesout
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Adam M. Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Alice B. Popejoy
- Department of Public Health Sciences, University of California, Davis, Davis, CA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Daniela Puiu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison A. Regier
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Samuel Sacco
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Ashley D. Sanders
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Valerie A. Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Baergen I. Schultz
- National Institutes of Health (NIH)–National Human Genome Research Institute, Bethesda, MD, USA
| | | | - Jonas A. Sibbesen
- Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Jouni Sirén
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Michael W. Smith
- National Institutes of Health (NIH)–National Human Genome Research Institute, Bethesda, MD, USA
| | - Heidi J. Sofia
- National Institutes of Health (NIH)–National Human Genome Research Institute, Bethesda, MD, USA
| | - Ahmad N. Abou Tayoun
- Al Jalila Genomics Center of Excellence, Al Jalila Children’s Specialty Hospital, Dubai, UAE
- Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Chad Tomlinson
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Francesca Floriana Tricomi
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Mitchell R. Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brian Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ting Wang
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | | | - Aleksey V. Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Justin M. Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| |
Collapse
|
20
|
Venner E, Patterson K, Kalra D, Wheeler MM, Chen YJ, Kalla SE, Yuan B, Karnes JH, Walker K, Smith JD, McGee S, Radhakrishnan A, Haddad A, Empey PE, Wang Q, Lichtenstein L, Toledo D, Jarvik G, Musick A, Gibbs RA. The frequency of pathogenic variation in the All of Us cohort reveals ancestry-driven disparities. Commun Biol 2024; 7:174. [PMID: 38374434 PMCID: PMC10876563 DOI: 10.1038/s42003-023-05708-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 12/13/2023] [Indexed: 02/21/2024] Open
Abstract
Disparities in data underlying clinical genomic interpretation is an acknowledged problem, but there is a paucity of data demonstrating it. The All of Us Research Program is collecting data including whole-genome sequences, health records, and surveys for at least a million participants with diverse ancestry and access to healthcare, representing one of the largest biomedical research repositories of its kind. Here, we examine pathogenic and likely pathogenic variants that were identified in the All of Us cohort. The European ancestry subgroup showed the highest overall rate of pathogenic variation, with 2.26% of participants having a pathogenic variant. Other ancestry groups had lower rates of pathogenic variation, including 1.62% for the African ancestry group and 1.32% in the Latino/Admixed American ancestry group. Pathogenic variants were most frequently observed in genes related to Breast/Ovarian Cancer or Hypercholesterolemia. Variant frequencies in many genes were consistent with the data from the public gnomAD database, with some notable exceptions resolved using gnomAD subsets. Differences in pathogenic variant frequency observed between ancestral groups generally indicate biases of ascertainment of knowledge about those variants, but some deviations may be indicative of differences in disease prevalence. This work will allow targeted precision medicine efforts at revealed disparities.
Collapse
Affiliation(s)
- Eric Venner
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
| | - Karynne Patterson
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Divya Kalra
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Marsha M Wheeler
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Yi-Ju Chen
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Sara E Kalla
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Bo Yuan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Jason H Karnes
- University of Arizona, R Ken Coit College of Pharmacy, Department of Pharmacy Practice and Science, Tucson, AZ, USA
- Vanderbilt University Medical Center, Department of Biomedical Informatics, Boston, MA, USA
| | - Kimberly Walker
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Joshua D Smith
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Sean McGee
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | | | - Andrew Haddad
- Department of Pharmaceutical Sciences, University of Pittsburgh School of Pharmacy, Pittsburgh, PA, USA
| | - Philip E Empey
- Department of Pharmacy and Therapeutics, University of Pittsburgh School of Pharmacy, Pittsburgh, PA, USA
| | - Qiaoyan Wang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | | | - Diana Toledo
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Gail Jarvik
- Department of Medicine (Medical Genetics), University of Washington School of Medicine, Seattle, WA, USA
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Anjene Musick
- NIH All of Us Research Program, National Institutes of Health Office of the Director, Bethesda, MD, USA
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| |
Collapse
|
21
|
Abondio P, Bruno F, Passarino G, Montesanto A, Luiselli D. Pangenomics: A new era in the field of neurodegenerative diseases. Ageing Res Rev 2024; 94:102180. [PMID: 38163518 DOI: 10.1016/j.arr.2023.102180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 12/14/2023] [Accepted: 12/28/2023] [Indexed: 01/03/2024]
Abstract
A pangenome is composed of all the genetic variability of a group of individuals, and its application to the study of neurodegenerative diseases may provide valuable insights into the underlying aspects of genetic heterogenetiy for these complex ailments, including gene expression, epigenetics, and translation mechanisms. Furthermore, a reference pangenome allows for the identification of previously undetected structural commonalities and differences among individuals, which may help in the diagnosis of a disease, support the prediction of what will happen over time (prognosis) and aid in developing novel treatments in the perspective of personalized medicine. Therefore, in the present review, the application of the pangenome concept to the study of neurodegenerative diseases will be discussed and analyzed for its potential to enable an improvement in diagnosis and prognosis for these illnesses, leading to the development of tailored treatments for individual patients from the knowledge of the genomic composition of a whole population.
Collapse
Affiliation(s)
- Paolo Abondio
- Laboratory of Ancient DNA, Department of Cultural Heritage, University of Bologna, Via degli Ariani 1, 48121 Ravenna, Italy.
| | - Francesco Bruno
- Academy of Cognitive Behavioral Sciences of Calabria (ASCoC), Lamezia Terme, Italy; Regional Neurogenetic Centre (CRN), Department of Primary Care, Azienda Sanitaria Provinciale Di Catanzaro, Viale A. Perugini, 88046 Lamezia Terme, CZ, Italy; Association for Neurogenetic Research (ARN), Lamezia Terme, CZ, Italy
| | - Giuseppe Passarino
- Department of Biology, Ecology and Earth Sciences, University of Calabria, Rende 87036, Italy
| | - Alberto Montesanto
- Department of Biology, Ecology and Earth Sciences, University of Calabria, Rende 87036, Italy
| | - Donata Luiselli
- Laboratory of Ancient DNA, Department of Cultural Heritage, University of Bologna, Via degli Ariani 1, 48121 Ravenna, Italy
| |
Collapse
|
22
|
Dotan E, Lynch SM, Ryan JC, Mitchell EP. Disparities in care of older adults of color with cancer: A narrative review. Cancer Med 2024; 13:e6790. [PMID: 38234214 PMCID: PMC10905558 DOI: 10.1002/cam4.6790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 10/06/2023] [Accepted: 11/23/2023] [Indexed: 01/19/2024] Open
Abstract
This review describes the barriers and challenges faced by older adults of color with cancer and highlights methods to improve their overall care. In the next decade, cancer incidence rates are expected to increase in the United States for people aged ≥65 years. A large proportion will be older adults of color who often have worse outcomes than older White patients. Many issues contribute to racial disparities in older adults, including biological factors and social determinants of health (SDOH) related to healthcare access, socioeconomic concerns, systemic racism, mistrust, and the neighborhood where a person lives. These disparities are exacerbated by age-related challenges often experienced by older adults, such as decreased functional status, impaired cognition, high rates of comorbidities and polypharmacy, poor nutrition, and limited social support. Additionally, underrepresentation of both patients of color and older adults in cancer clinical research results in a lack of adequate data to guide the management of these patients. Use of geriatric assessments (GA) can aid providers in uncovering age-related concerns and personalizing interventions for older patients. Research demonstrates the ability of GA-directed care to result in fewer treatment-related toxicities and improved quality of life, thus supporting the routine incorporation of validated GA into these patients' care. GA can be enhanced by including evaluation of SDOH, which can help healthcare providers understand and address the needs of older adults of color with cancer who face disparities related to their age and race.
Collapse
Affiliation(s)
- Efrat Dotan
- Department of Hematology/OncologyFox Chase Cancer CenterPhiladelphiaPennsylvaniaUSA
| | | | | | - Edith P. Mitchell
- Clinical Professor of Medicine and Medical OncologySidney Kimmel Cancer Center at JeffersonPhiladelphiaPennsylvaniaUSA
| |
Collapse
|
23
|
Zhang Z, Jiang T, Li G, Cao S, Liu Y, Liu B, Wang Y. Kled: an ultra-fast and sensitive structural variant detection tool for long-read sequencing data. Brief Bioinform 2024; 25:bbae049. [PMID: 38385878 PMCID: PMC10883419 DOI: 10.1093/bib/bbae049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 01/12/2024] [Accepted: 01/26/2024] [Indexed: 02/23/2024] Open
Abstract
Structural Variants (SVs) are a crucial type of genetic variant that can significantly impact phenotypes. Therefore, the identification of SVs is an essential part of modern genomic analysis. In this article, we present kled, an ultra-fast and sensitive SV caller for long-read sequencing data given the specially designed approach with a novel signature-merging algorithm, custom refinement strategies and a high-performance program structure. The evaluation results demonstrate that kled can achieve optimal SV calling compared to several state-of-the-art methods on simulated and real long-read data for different platforms and sequencing depths. Furthermore, kled excels at rapid SV calling and can efficiently utilize multiple Central Processing Unit (CPU) cores while maintaining low memory usage. The source code for kled can be obtained from https://github.com/CoREse/kled.
Collapse
Affiliation(s)
- Zhendong Zhang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Tao Jiang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Zhengzhou Research Institute, Harbin Institute of Technology, Zhengzhou, Henan, 450000, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Gaoyang Li
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Shuqi Cao
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Yadong Liu
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Zhengzhou Research Institute, Harbin Institute of Technology, Zhengzhou, Henan, 450000, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Bo Liu
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Zhengzhou Research Institute, Harbin Institute of Technology, Zhengzhou, Henan, 450000, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Yadong Wang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Zhengzhou Research Institute, Harbin Institute of Technology, Zhengzhou, Henan, 450000, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| |
Collapse
|
24
|
Dylus D, Altenhoff A, Majidian S, Sedlazeck FJ, Dessimoz C. Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree. Nat Biotechnol 2024; 42:139-147. [PMID: 37081138 PMCID: PMC10791578 DOI: 10.1038/s41587-023-01753-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 03/16/2023] [Indexed: 04/22/2023]
Abstract
Current methods for inference of phylogenetic trees require running complex pipelines at substantial computational and labor costs, with additional constraints in sequencing coverage, assembly and annotation quality, especially for large datasets. To overcome these challenges, we present Read2Tree, which directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy. In a benchmark encompassing a broad variety of datasets, Read2Tree is 10-100 times faster than assembly-based approaches and in most cases more accurate-the exception being when sequencing coverage is high and reference species very distant. Here, to illustrate the broad applicability of the tool, we reconstruct a yeast tree of life of 435 species spanning 590 million years of evolution. We also apply Read2Tree to >10,000 Coronaviridae samples, accurately classifying highly diverse animal samples and near-identical severe acute respiratory syndrome coronavirus 2 sequences on a single tree. The speed, accuracy and versatility of Read2Tree enable comparative genomics at scale.
Collapse
Affiliation(s)
- David Dylus
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- F. Hoffmann-La Roche Ltd, Immunology, Infectious Disease, and Ophthalmology (I2O), Roche Pharmaceutical Research and Early Development (pRED), Basel, Switzerland
| | - Adrian Altenhoff
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Department of Computer Science, ETH, Zurich, Switzerland
| | - Sina Majidian
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
- Department of Computer Science, Rice University, Houston, TX, USA.
| | - Christophe Dessimoz
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
- Department of Computer Science, University College London, London, UK.
- Centre for Life's Origins and Evolution, Department of Genetics, Evolution and Environment, University College London, London, UK.
| |
Collapse
|
25
|
Faltejsková K, Vondrášek J. PAPerFly: Partial Assembly-based Peak Finder for ab initio binding site reconstruction. BMC Bioinformatics 2023; 24:487. [PMID: 38114921 PMCID: PMC10731698 DOI: 10.1186/s12859-023-05613-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 12/11/2023] [Indexed: 12/21/2023] Open
Abstract
BACKGROUND The specific recognition of a DNA locus by a given transcription factor is a widely studied issue. It is generally agreed that the recognition can be influenced not only by the binding motif but by the larger context of the binding site. In this work, we present a novel heuristic algorithm that can reconstruct the unique binding sites captured in a sequencing experiment without using the reference genome. RESULTS We present PAPerFly, the Partial Assembly-based Peak Finder, a tool for the binding site and binding context reconstruction from the sequencing data without any prior knowledge. This tool operates without the need to know the reference genome of the respective organism. We employ algorithmic approaches that are used during genome assembly. The proposed algorithm constructs a de Bruijn graph from the sequencing data. Based on this graph, sequences and their enrichment are reconstructed using a novel heuristic algorithm. The reconstructed sequences are aligned and the peaks in the sequence enrichment are identified. Our approach was tested by processing several ChIP-seq experiments available in the ENCODE database and comparing the results of Paperfly and standard methods. CONCLUSIONS We show that PAPerFly, an algorithm tailored for experiment analysis without the reference genome, yields better results than an aggregation of ChIP-seq agnostic tools. Our tool is freely available at https://github.com/Caeph/paperfly/ or on Zenodo ( https://doi.org/10.5281/zenodo.7116424 ).
Collapse
Affiliation(s)
- Kateřina Faltejsková
- Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo náměstí 542/2, 160 00, Prague, Czech Republic.
- Computer Science Institute, Faculty of Mathematics and Physics, Charles University, Malostranské náměstí 25, 118 00, Prague, Czech Republic.
| | - Jiří Vondrášek
- Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo náměstí 542/2, 160 00, Prague, Czech Republic.
| |
Collapse
|
26
|
LoTempio J, Delot E, Vilain E. Benchmarking long-read genome sequence alignment tools for human genomics applications. PeerJ 2023; 11:e16515. [PMID: 38130927 PMCID: PMC10734412 DOI: 10.7717/peerj.16515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Accepted: 11/02/2023] [Indexed: 12/23/2023] Open
Abstract
Background The utility of long-read genome sequencing platforms has been shown in many fields including whole genome assembly, metagenomics, and amplicon sequencing. Less clear is the applicability of long reads to reference-guided human genomics, which is the foundation of genomic medicine. Here, we benchmark available platform-agnostic alignment tools on datasets from nanopore and single-molecule real-time platforms to understand their suitability in producing a genome representation. Results For this study, we leveraged publicly-available data from sample NA12878 generated on Oxford Nanopore and sample NA24385 on Pacific Biosciences platforms. We employed state of the art sequence alignment tools including GraphMap2, long-read aligner (LRA), Minimap2, CoNvex Gap-cost alignMents for Long Reads (NGMLR), and Winnowmap2. Minimap2 and Winnowmap2 were computationally lightweight enough for use at scale, while GraphMap2 was not. NGMLR took a long time and required many resources, but produced alignments each time. LRA was fast, but only worked on Pacific Biosciences data. Each tool widely disagreed on which reads to leave unaligned, affecting the end genome coverage and the number of discoverable breakpoints. No alignment tool independently resolved all large structural variants (1,001-100,000 base pairs) present in the Database of Genome Variants (DGV) for sample NA12878 or the truthset for NA24385. Conclusions These results suggest a combined approach is needed for LRS alignments for human genomics. Specifically, leveraging alignments from three tools will be more effective in generating a complete picture of genomic variability. It should be best practice to use an analysis pipeline that generates alignments with both Minimap2 and Winnowmap2 as they are lightweight and yield different views of the genome. Depending on the question at hand, the data available, and the time constraints, NGMLR and LRA are good options for a third tool. If computational resources and time are not a factor for a given case or experiment, NGMLR will provide another view, and another chance to resolve a case. LRA, while fast, did not work on the nanopore data for our cluster, but PacBio results were promising in that those computations completed faster than Minimap2. Due to its significant burden on computational resources and slow run time, Graphmap2 is not an ideal tool for exploration of a whole human genome generated on a long-read sequencing platform.
Collapse
Affiliation(s)
- Jonathan LoTempio
- Institute for Clinical and Translational Science, University of California, Irvine, CA, United States of America
- International Research Laboratory (IRL2006) “Epigenetics, Data, Politics (EpiDaPo)”, Centre National de la Recherche Scientifique, Washington, DC, United States of America
| | - Emmanuele Delot
- Center for Genetic Medicine Research, Children’s National Hospital, Washington, DC, United States of America
- Department of Genomics and Precision Medicine, George Washington University, Washington, DC, United States of America
| | - Eric Vilain
- Institute for Clinical and Translational Science, University of California, Irvine, CA, United States of America
- International Research Laboratory (IRL2006) “Epigenetics, Data, Politics (EpiDaPo)”, Centre National de la Recherche Scientifique, Washington, DC, United States of America
| |
Collapse
|
27
|
Kabata F, Thaldar D. The human genome as the common heritage of humanity. Front Genet 2023; 14:1282515. [PMID: 38028596 PMCID: PMC10662319 DOI: 10.3389/fgene.2023.1282515] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 10/16/2023] [Indexed: 12/01/2023] Open
Abstract
While debate on the international regulation of human genomic research remains unsettled, the Universal Declaration on the Human Genome and Human Rights, 1997 qualifies the human genome as "heritage of humankind" in a symbolic sense. Using document analysis this article assesses whether, how and to what extent the common heritage framework is relevant in regulation of human genomic research. The article traces the history of the Human Genome Project to reveal the international community's race against privatization of the human genome and its resulting qualification as the common heritage of humanity. Further, it reviews the archival records of UNESCO's International Bioethics Committee to discover the rationale for qualifying the human genome as common heritage of humankind. The article finds that the common heritage of mankind framework remains relevant to the application of the human genome at the collective level. However, the framework is at odds with the individual dimension of the human genome based on individual personality rights. The article thus argues that the right to benefit from scientific progress and its applications offers an alternative international regulatory framework for human genomic research.
Collapse
Affiliation(s)
- Faith Kabata
- School of Law, University of KwaZulu-Natal, Durban, South Africa
| | - Donrich Thaldar
- School of Law, University of KwaZulu-Natal, Durban, South Africa
- Petrie-Flom Center for Health Law Policy, Biotechnology, and Bioethics, Harvard Law School, Cambridge, MA, United States
| |
Collapse
|
28
|
Chrisman B, He C, Jung JY, Stockham N, Paskov K, Washington P, Petereit J, Wall DP. Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity. Genome Res 2023; 33:1734-1746. [PMID: 37879860 PMCID: PMC10691534 DOI: 10.1101/gr.277175.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 05/25/2023] [Indexed: 10/27/2023]
Abstract
Although it is ubiquitous in genomics, the current human reference genome (GRCh38) is incomplete: It is missing large sections of heterochromatic sequence, and as a singular, linear reference genome, it does not represent the full spectrum of human genetic diversity. To characterize gaps in GRCh38 and human genetic diversity, we developed an algorithm for sequence location approximation using nuclear families (ASLAN) to identify the region of origin of reads that do not align to GRCh38. Using unmapped reads and variant calls from whole-genome sequences (WGSs), ASLAN uses a maximum likelihood model to identify the most likely region of the genome that a subsequence belongs to given the distribution of the subsequence in the unmapped reads and phasings of families. Validating ASLAN on synthetic data and on reads from the alternative haplotypes in the decoy genome, ASLAN localizes >90% of 100-bp sequences with >92% accuracy and ∼1 Mb of resolution. We then ran ASLAN on 100-mers from unmapped reads from WGS from more than 700 families, and compared ASLAN localizations to alignment of the 100-mers to the recently released T2T-CHM13 assembly. We found that many unmapped reads in GRCh38 originate from telomeres and centromeres that are gaps in GRCh38. ASLAN localizations are in high concordance with T2T-CHM13 alignments, except in the centromeres of the acrocentric chromosomes. Comparing ASLAN localizations and T2T-CHM13 alignments, we identified sequences missing from T2T-CHM13 or sequences with high divergence from their aligned region in T2T-CHM13, highlighting new hotspots for genetic diversity.
Collapse
Affiliation(s)
- Brianna Chrisman
- Department of Bioengineering, Stanford University, Stanford, California 94305, USA;
- Nevada Bioinformatics Center, University of Nevada, Reno, Nevada 89557, USA
| | - Chloe He
- Department of Biomedical Data Science, Stanford University, Stanford, California 94305, USA
| | - Jae-Yoon Jung
- Department of Pediatrics (Systems Medicine), Stanford University, Stanford, California 94305, USA
| | - Nate Stockham
- Department of Neuroscience, Stanford University, Stanford, California 94305, USA
| | - Kelley Paskov
- Department of Biomedical Data Science, Stanford University, Stanford, California 94305, USA
| | - Peter Washington
- Department of Bioengineering, Stanford University, Stanford, California 94305, USA
| | - Juli Petereit
- Nevada Bioinformatics Center, University of Nevada, Reno, Nevada 89557, USA
| | - Dennis P Wall
- Department of Biomedical Data Science, Stanford University, Stanford, California 94305, USA
- Department of Pediatrics (Systems Medicine), Stanford University, Stanford, California 94305, USA
| |
Collapse
|
29
|
Yang C, Zhou Y, Song Y, Wu D, Zeng Y, Nie L, Liu P, Zhang S, Chen G, Xu J, Zhou H, Zhou L, Qian X, Liu C, Tan S, Zhou C, Dai W, Xu M, Qi Y, Wang X, Guo L, Fan G, Wang A, Deng Y, Zhang Y, Jin J, He Y, Guo C, Guo G, Zhou Q, Xu X, Yang H, Wang J, Xu S, Mao Y, Jin X, Ruan J, Zhang G. The complete and fully-phased diploid genome of a male Han Chinese. Cell Res 2023; 33:745-761. [PMID: 37452091 PMCID: PMC10542383 DOI: 10.1038/s41422-023-00849-5] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2023] [Accepted: 06/29/2023] [Indexed: 07/18/2023] Open
Abstract
Since the release of the complete human genome, the priority of human genomic study has now been shifting towards closing gaps in ethnic diversity. Here, we present a fully phased and well-annotated diploid human genome from a Han Chinese male individual (CN1), in which the assemblies of both haploids achieve the telomere-to-telomere (T2T) level. Comparison of this diploid genome with the CHM13 haploid T2T genome revealed significant variations in the centromere. Outside the centromere, we discovered 11,413 structural variations, including numerous novel ones. We also detected thousands of CN1 alleles that have accumulated high substitution rates and a few that have been under positive selection in the East Asian population. Further, we found that CN1 outperforms CHM13 as a reference genome in mapping and variant calling for the East Asian population owing to the distinct structural variants of the two references. Comparison of SNP calling for a large cohort of 8869 Chinese genomes using CN1 and CHM13 as reference respectively showed that the reference bias profoundly impacts rare SNP calling, with nearly 2 million rare SNPs miss-called with different reference genomes. Finally, applying the CN1 as a reference, we discovered 5.80 Mb and 4.21 Mb putative introgression sequences from Neanderthal and Denisovan, respectively, including many East Asian specific ones undetected using CHM13 as the reference. Our analyses reveal the advances of using CN1 as a reference for population genomic studies and paleo-genomic studies. This complete genome will serve as an alternative reference for future genomic studies on the East Asian population.
Collapse
Affiliation(s)
- Chentao Yang
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Yang Zhou
- BGI-Shenzhen, Shenzhen, Guangdong, China
- BGI Research-Wuhan, BGI, Wuhan, Hubei, China
| | - Yanni Song
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China
| | - Dongya Wu
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China
- Institute of Crop Science & Institute of Bioinformatics, Zhejiang University, Hangzhou, Zhejiang, China
| | - Yan Zeng
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Lei Nie
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | | | - Shilong Zhang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Guangji Chen
- BGI-Shenzhen, Shenzhen, Guangdong, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Jinjin Xu
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Hongling Zhou
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China
| | - Long Zhou
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China
- Innovation Center of Yangtze River Delta, Zhejiang University, Hangzhou, Zhejiang, China
| | - Xiaobo Qian
- BGI-Shenzhen, Shenzhen, Guangdong, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Chenlu Liu
- Life Sciences Institute, Zhejiang University, Hangzhou, Zhejiang, China
| | | | | | - Wei Dai
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Mengyang Xu
- BGI-Shenzhen, Shenzhen, Guangdong, China
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Yanwei Qi
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Xiaobo Wang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China
| | - Lidong Guo
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Guangyi Fan
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Aijun Wang
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Yuan Deng
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Yong Zhang
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | | | - Yunqiu He
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| | - Chunxue Guo
- BGI-Shenzhen, Shenzhen, Guangdong, China
- BGI-Hangzhou, Hangzhou, Zhejiang, China
| | - Guoji Guo
- School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
| | - Qing Zhou
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China
- Life Sciences Institute, Zhejiang University, Hangzhou, Zhejiang, China
| | - Xun Xu
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | | | - Jian Wang
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Shuhua Xu
- State Key Laboratory of Genetic Engineering, Center for Evolutionary Biology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, China
- Human Phenome Institute, Zhangjiang Fudan International Innovation Center, and Ministry of Education Key Laboratory of Contemporary Anthropology, Fudan University, Shanghai, China
- Jiangsu Key Laboratory of Phylogenomics & Comparative Genomics, International Joint Center of Genomics of Jiangsu Province School of Life Sciences, Jiangsu Normal University, Xuzhou, Jiangsu, China
- Department of Liver Surgery and Transplantation Liver Cancer Institute, Zhongshan Hospital, Fudan University, Shanghai, China
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, Yunnan, China
| | - Yafei Mao
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Xin Jin
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China.
| | - Guojie Zhang
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China.
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China.
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China.
- Innovation Center of Yangtze River Delta, Zhejiang University, Hangzhou, Zhejiang, China.
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan, China.
| |
Collapse
|
30
|
Lee H, Greer SU, Pavlichin DS, Zhou B, Urban AE, Weissman T, Ji HP. Pan-conserved segment tags identify ultra-conserved sequences across assemblies in the human pangenome. CELL REPORTS METHODS 2023; 3:100543. [PMID: 37671027 PMCID: PMC10475782 DOI: 10.1016/j.crmeth.2023.100543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Revised: 04/14/2023] [Accepted: 07/06/2023] [Indexed: 09/07/2023]
Abstract
The human pangenome, a new reference sequence, addresses many limitations of the current GRCh38 reference. The first release is based on 94 high-quality haploid assemblies from individuals with diverse backgrounds. We employed a k-mer indexing strategy for comparative analysis across multiple assemblies, including the pangenome reference, GRCh38, and CHM13, a telomere-to-telomere reference assembly. Our k-mer indexing approach enabled us to identify a valuable collection of universally conserved sequences across all assemblies, referred to as "pan-conserved segment tags" (PSTs). By examining intervals between these segments, we discerned highly conserved genomic segments and those with structurally related polymorphisms. We found 60,764 polymorphic intervals with unique geo-ethnic features in the pangenome reference. In this study, we utilized ultra-conserved sequences (PSTs) to forge a link between human pangenome assemblies and reference genomes. This methodology enables the examination of any sequence of interest within the pangenome, using the reference genome as a comparative framework.
Collapse
Affiliation(s)
- HoJoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Stephanie U. Greer
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Dmitri S. Pavlichin
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Bo Zhou
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Alexander E. Urban
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Palo Alto, CA 94304, USA
| | - Hanlee P. Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Electrical Engineering, Stanford University, Palo Alto, CA 94304, USA
| |
Collapse
|
31
|
Abstract
DNA sequencing has revolutionized medicine over recent decades. However, analysis of large structural variation and repetitive DNA, a hallmark of human genomes, has been limited by short-read technology, with read lengths of 100-300 bp. Long-read sequencing (LRS) permits routine sequencing of human DNA fragments tens to hundreds of kilobase pairs in size, using both real-time sequencing by synthesis and nanopore-based direct electronic sequencing. LRS permits analysis of large structural variation and haplotypic phasing in human genomes and has enabled the discovery and characterization of rare pathogenic structural variants and repeat expansions. It has also recently enabled the assembly of a complete, gapless human genome that includes previously intractable regions, such as highly repetitive centromeres and homologous acrocentric short arms. With the addition of protocols for targeted enrichment, direct epigenetic DNA modification detection, and long-range chromatin profiling, LRS promises to launch a new era of understanding of genetic diversity and pathogenic mutations in human populations.
Collapse
Affiliation(s)
- Peter E Warburton
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA; ,
- Center for Advanced Genomics Technology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Robert P Sebra
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA; ,
- Center for Advanced Genomics Technology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Black Family Stem Cell Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Icahn Genomics Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
32
|
Rulten SL, Grose RP, Gatz SA, Jones JL, Cameron AJM. The Future of Precision Oncology. Int J Mol Sci 2023; 24:12613. [PMID: 37628794 PMCID: PMC10454858 DOI: 10.3390/ijms241612613] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 08/03/2023] [Accepted: 08/05/2023] [Indexed: 08/27/2023] Open
Abstract
Our understanding of the molecular mechanisms underlying cancer development and evolution have evolved rapidly over recent years, and the variation from one patient to another is now widely recognized. Consequently, one-size-fits-all approaches to the treatment of cancer have been superseded by precision medicines that target specific disease characteristics, promising maximum clinical efficacy, minimal safety concerns, and reduced economic burden. While precision oncology has been very successful in the treatment of some tumors with specific characteristics, a large number of patients do not yet have access to precision medicines for their disease. The success of next-generation precision oncology depends on the discovery of new actionable disease characteristics, rapid, accurate, and comprehensive diagnosis of complex phenotypes within each patient, novel clinical trial designs with improved response rates, and worldwide access to novel targeted anticancer therapies for all patients. This review outlines some of the current technological trends, and highlights some of the complex multidisciplinary efforts that are underway to ensure that many more patients with cancer will be able to benefit from precision oncology in the near future.
Collapse
Affiliation(s)
| | - Richard P. Grose
- Centre for Tumour Biology, Barts Cancer Institute, Queen Mary University of London, London EC1M 6BQ, UK; (R.P.G.); (J.L.J.)
| | - Susanne A. Gatz
- Cancer Research UK Clinical Trials Unit (CRCTU), Institute of Cancer and Genomic Sciences, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK;
| | - J. Louise Jones
- Centre for Tumour Biology, Barts Cancer Institute, Queen Mary University of London, London EC1M 6BQ, UK; (R.P.G.); (J.L.J.)
| | - Angus J. M. Cameron
- Centre for Tumour Biology, Barts Cancer Institute, Queen Mary University of London, London EC1M 6BQ, UK; (R.P.G.); (J.L.J.)
| |
Collapse
|
33
|
Yu H, Zheng Z, Su J, Lam TW, Luo R. Boosting variant-calling performance with multi-platform sequencing data using Clair3-MP. BMC Bioinformatics 2023; 24:308. [PMID: 37537536 PMCID: PMC10401749 DOI: 10.1186/s12859-023-05434-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 07/31/2023] [Indexed: 08/05/2023] Open
Abstract
BACKGROUND With the continuous advances in third-generation sequencing technology and the increasing affordability of next-generation sequencing technology, sequencing data from different sequencing technology platforms is becoming more common. While numerous benchmarking studies have been conducted to compare variant-calling performance across different platforms and approaches, little attention has been paid to the potential of leveraging the strengths of different platforms to optimize overall performance, especially integrating Oxford Nanopore and Illumina sequencing data. RESULTS We investigated the impact of multi-platform data on the performance of variant calling through carefully designed experiments with a deep learning-based variant caller named Clair3-MP (Multi-Platform). Through our research, we not only demonstrated the capability of ONT-Illumina data for improved variant calling, but also identified the optimal scenarios for utilizing ONT-Illumina data. In addition, we revealed that the improvement in variant calling using ONT-Illumina data comes from an improvement in difficult genomic regions, such as the large low-complexity regions and segmental and collapse duplication regions. Moreover, Clair3-MP can incorporate reference genome stratification information to achieve a small but measurable improvement in variant calling. Clair3-MP is accessible as an open-source project at: https://github.com/HKU-BAL/Clair3-MP . CONCLUSIONS These insights have important implications for researchers and practitioners alike, providing valuable guidance for improving the reliability and efficiency of genomic analysis in diverse applications.
Collapse
Affiliation(s)
- Huijing Yu
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China
| | - Zhenxian Zheng
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China
| | - Junhao Su
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China.
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China.
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China.
| |
Collapse
|
34
|
Ma J, Cáceres M, Salmela L, Mäkinen V, Tomescu AI. Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. Bioinformatics 2023; 39:btad460. [PMID: 37494467 PMCID: PMC10423031 DOI: 10.1093/bioinformatics/btad460] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 06/08/2023] [Accepted: 07/25/2023] [Indexed: 07/28/2023] Open
Abstract
MOTIVATION Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875-9)] is a popular aligner of short reads, GraphAligner [Rautiainen and Marschall (GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020;21:253-28)] is the state-of-the-art aligner of erroneous long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. RESULTS We present a new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer. We run experiments aligning real and simulated PacBio CLR reads with average error rates 15% and 5%. Compared to GraphAligner, GraphChainer aligns 12-17% more reads, and 21-28% more total read length, on real PacBio CLR reads from human chromosomes 1, 22, and the whole human pangenome. On both simulated and real data, GraphChainer aligns between 95% and 99% of all reads, and of total read length. We also show that minigraph [Li et al. (The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265-19.)] and minichain [Chandra and Jain (Sequence to graph alignment using gap-sensitive co-linear chaining. In: Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2023). Springer, 2023, 58-73.)] obtain an accuracy of <60% on this setting. AVAILABILITY AND IMPLEMENTATION GraphChainer is freely available at https://github.com/algbio/GraphChainer. The datasets and evaluation pipeline can be reached from the previous address.
Collapse
Affiliation(s)
- Jun Ma
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Manuel Cáceres
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Leena Salmela
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Veli Mäkinen
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Alexandru I Tomescu
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| |
Collapse
|
35
|
Houwaart T, Scholz S, Pollock NR, Palmer WH, Kichula KM, Strelow D, Le DB, Belick D, Hülse L, Lautwein T, Wachtmeister T, Wollenweber TE, Henrich B, Köhrer K, Parham P, Guethlein LA, Norman PJ, Dilthey AT. Complete sequences of six major histocompatibility complex haplotypes, including all the major MHC class II structures. HLA 2023; 102:28-43. [PMID: 36932816 PMCID: PMC10986641 DOI: 10.1111/tan.15020] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 02/10/2023] [Accepted: 02/24/2023] [Indexed: 03/19/2023]
Abstract
Accurate and comprehensive immunogenetic reference panels are key to the successful implementation of population-scale immunogenomics. The 5Mbp Major Histocompatibility Complex (MHC) is the most polymorphic region of the human genome and associated with multiple immune-mediated diseases, transplant matching and therapy responses. Analysis of MHC genetic variation is severely complicated by complex patterns of sequence variation, linkage disequilibrium and a lack of fully resolved MHC reference haplotypes, increasing the risk of spurious findings on analyzing this medically important region. Integrating Illumina, ultra-long Nanopore, and PacBio HiFi sequencing as well as bespoke bioinformatics, we completed five of the alternative MHC reference haplotypes of the current (GRCh38/hg38) build of the human reference genome and added one other. The six assembled MHC haplotypes encompass the DR1 and DR4 haplotype structures in addition to the previously completed DR2 and DR3, as well as six distinct classes of the structurally variable C4 region. Analysis of the assembled haplotypes showed that MHC class II sequence structures, including repeat element positions, are generally conserved within the DR haplotype supergroups, and that sequence diversity peaks in three regions around HLA-A, HLA-B+C, and the HLA class II genes. Demonstrating the potential for improved short-read analysis, the number of proper read pairs recruited to the MHC was found to be increased by 0.06%-0.49% in a 1000 Genomes Project read remapping experiment with seven diverse samples. Furthermore, the assembled haplotypes can serve as references for the community and provide the basis of a structurally accurate genotyping graph of the complete MHC region.
Collapse
Affiliation(s)
- Torsten Houwaart
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Stephan Scholz
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Nicholas R. Pollock
- Department of Biomedical InformaticsAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
- Department of Immunology and MicrobiologyAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
| | - William H. Palmer
- Department of Biomedical InformaticsAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
- Department of Immunology and MicrobiologyAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
| | - Katherine M. Kichula
- Department of Biomedical InformaticsAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
- Department of Immunology and MicrobiologyAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
| | - Daniel Strelow
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Duyen B. Le
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Dana Belick
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Lisanna Hülse
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Tobias Lautwein
- Biologisch‐Medizinisches‐Forschungszentrum (BMFZ)Genomics & Transcriptomics Laboratory, Heinrich Heine University DüsseldorfDüsseldorfGermany
| | - Thorsten Wachtmeister
- Biologisch‐Medizinisches‐Forschungszentrum (BMFZ)Genomics & Transcriptomics Laboratory, Heinrich Heine University DüsseldorfDüsseldorfGermany
| | - Tassilo E. Wollenweber
- Biologisch‐Medizinisches‐Forschungszentrum (BMFZ)Genomics & Transcriptomics Laboratory, Heinrich Heine University DüsseldorfDüsseldorfGermany
| | - Birgit Henrich
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Karl Köhrer
- Biologisch‐Medizinisches‐Forschungszentrum (BMFZ)Genomics & Transcriptomics Laboratory, Heinrich Heine University DüsseldorfDüsseldorfGermany
| | - Peter Parham
- Department of Structural Biology, and Department of Microbiology and ImmunologyStanford UniversityStanfordCaliforniaUSA
| | - Lisbeth A. Guethlein
- Department of Structural Biology, and Department of Microbiology and ImmunologyStanford UniversityStanfordCaliforniaUSA
| | - Paul J. Norman
- Department of Biomedical InformaticsAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
- Department of Immunology and MicrobiologyAnschutz Medical Campus, University of ColoradoAuroraColoradoUSA
| | - Alexander T. Dilthey
- Institute of Medical Microbiology and Hospital HygieneHeinrich Heine University DüsseldorfDüsseldorfGermany
| |
Collapse
|
36
|
Gao Y, Yang X, Chen H, Tan X, Yang Z, Deng L, Wang B, Kong S, Li S, Cui Y, Lei C, Wang Y, Pan Y, Ma S, Sun H, Zhao X, Shi Y, Yang Z, Wu D, Wu S, Zhao X, Shi B, Jin L, Hu Z, Lu Y, Chu J, Ye K, Xu S. A pangenome reference of 36 Chinese populations. Nature 2023; 619:112-121. [PMID: 37316654 PMCID: PMC10322713 DOI: 10.1038/s41586-023-06173-7] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Accepted: 05/05/2023] [Indexed: 06/16/2023]
Abstract
Human genomics is witnessing an ongoing paradigm shift from a single reference sequence to a pangenome form, but populations of Asian ancestry are underrepresented. Here we present data from the first phase of the Chinese Pangenome Consortium, including a collection of 116 high-quality and haplotype-phased de novo assemblies based on 58 core samples representing 36 minority Chinese ethnic groups. With an average 30.65× high-fidelity long-read sequence coverage, an average contiguity N50 of more than 35.63 megabases and an average total size of 3.01 gigabases, the CPC core assemblies add 189 million base pairs of euchromatic polymorphic sequences and 1,367 protein-coding gene duplications to GRCh38. We identified 15.9 million small variants and 78,072 structural variants, of which 5.9 million small variants and 34,223 structural variants were not reported in a recently released pangenome reference1. The Chinese Pangenome Consortium data demonstrate a remarkable increase in the discovery of novel and missing sequences when individuals are included from underrepresented minority ethnic groups. The missing reference sequences were enriched with archaic-derived alleles and genes that confer essential functions related to keratinization, response to ultraviolet radiation, DNA repair, immunological responses and lifespan, implying great potential for shedding new light on human evolution and recovering missing heritability in complex disease mapping.
Collapse
Affiliation(s)
- Yang Gao
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | - Xiaofei Yang
- School of Computer Science and Technology, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
- Genome Institute, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China
| | - Hao Chen
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Xinjiang Tan
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Zhaoqing Yang
- Department of Medical Genetics, Institute of Medical Biology, Chinese Academy of Medical Sciences, Kunming, China
| | - Lian Deng
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
| | - Baonan Wang
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
| | - Shuang Kong
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
| | - Songyang Li
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
| | - Yuhang Cui
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
| | - Chang Lei
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
| | - Yimin Wang
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yuwen Pan
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Sen Ma
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Hao Sun
- Department of Medical Genetics, Institute of Medical Biology, Chinese Academy of Medical Sciences, Kunming, China
| | - Xiaohan Zhao
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
| | - Yingbing Shi
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
| | - Ziyi Yang
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
| | - Dongdong Wu
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| | - Shaoyuan Wu
- Jiangsu Key Laboratory of Phylogenomics & Comparative Genomics, International Joint Center of Genomics of Jiangsu Province School of Life Sciences, Jiangsu Normal University, Xuzhou, China
| | - Xingming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Ministry of Education Key (MOE) Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, MOE Frontiers Center for Brain Science Fudan University, Shanghai, China
| | - Binyin Shi
- Department of Endocrinology, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China
| | - Li Jin
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China
| | - Zhibin Hu
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, China
- Jiangsu Key Lab of Cancer Biomarkers, Prevention and Treatment, Collaborative Innovation Center for Cancer Personalized Medicine, Center for Global Health, School of Public Health, Nanjing Medical University, Nanjing, China
| | - Yan Lu
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China.
| | - Jiayou Chu
- Department of Medical Genetics, Institute of Medical Biology, Chinese Academy of Medical Sciences, Kunming, China.
| | - Kai Ye
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China.
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China.
- School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, China.
| | - Shuhua Xu
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China.
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China.
- School of Life Science and Technology, ShanghaiTech University, Shanghai, China.
- Jiangsu Key Laboratory of Phylogenomics & Comparative Genomics, International Joint Center of Genomics of Jiangsu Province School of Life Sciences, Jiangsu Normal University, Xuzhou, China.
- Department of Liver Surgery and Transplantation Liver Cancer Institute, Zhongshan Hospital, Fudan University, Shanghai, China.
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China.
| |
Collapse
|
37
|
Smith TPL, Bickhart DM, Boichard D, Chamberlain AJ, Djikeng A, Jiang Y, Low WY, Pausch H, Demyda-Peyrás S, Prendergast J, Schnabel RD, Rosen BD. The Bovine Pangenome Consortium: democratizing production and accessibility of genome assemblies for global cattle breeds and other bovine species. Genome Biol 2023; 24:139. [PMID: 37337218 DOI: 10.1186/s13059-023-02975-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Accepted: 05/19/2023] [Indexed: 06/21/2023] Open
Abstract
The Bovine Pangenome Consortium (BPC) is an international collaboration dedicated to the assembly of cattle genomes to develop a more complete representation of cattle genomic diversity. The goal of the BPC is to provide genome assemblies and a community-agreed pangenome representation to replace breed-specific reference assemblies for cattle genomics. The BPC invites partners sharing our vision to participate in the production of these assemblies and the development of a common, community-approved, pangenome reference as a public resource for the research community ( https://bovinepangenome.github.io/ ). This community-driven resource will provide the context for comparison between studies and the future foundation for cattle genomic selection.
Collapse
Affiliation(s)
- Timothy P L Smith
- US Meat Animal Research Center, USDA-ARS, Clay Center, NE, 68933, USA
| | | | - Didier Boichard
- Université Paris-Saclay, INRAE, AgroParisTech, GABI, 78350, Jouy-en-Josas, France
| | - Amanda J Chamberlain
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, 3083, Australia
| | - Appolinaire Djikeng
- Centre for Tropical Livestock Genetics and Health, ILRI Kenya, Nairobi, 30709-00100, Kenya
- Centre for Tropical Livestock Genetics and Health, Easter Bush, Midlothian, EH25 9RG, UK
| | - Yu Jiang
- Center for Ruminant Genetics and Evolution, Northwest A&F University, Yangling, 712100, China
| | - Wai Y Low
- The Davies Research Centre, School of Animal and Veterinary Sciences, University of Adelaide, Roseworthy, SA, 5371, Australia
| | - Hubert Pausch
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland
| | - Sebastian Demyda-Peyrás
- Departamento de Producción Animal, Facultad de Ciencias Veterinarias, Universidad Nacional de La Plata, 1900, La Plata, Argentina
- Consejo Superior de Investigaciones Científicas Y Tecnológicas (CONICET), CCT-La Plata, 1900, La Plata, Argentina
| | - James Prendergast
- Centre for Tropical Livestock Genetics and Health, Easter Bush, Midlothian, EH25 9RG, UK
- The Roslin Institute, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, UK
| | - Robert D Schnabel
- Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Benjamin D Rosen
- Animal Genomics and Improvement Laboratory, USDA-ARS, Beltsville, MD, 20705, USA.
| |
Collapse
|
38
|
A pangenome reference representative of 36 minority Chinese ethnic groups. Nature 2023:10.1038/d41586-023-01675-w. [PMID: 37316594 DOI: 10.1038/d41586-023-01675-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
|
39
|
Abondio P, Cilli E, Luiselli D. Human Pangenomics: Promises and Challenges of a Distributed Genomic Reference. Life (Basel) 2023; 13:1360. [PMID: 37374141 DOI: 10.3390/life13061360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 06/02/2023] [Accepted: 06/08/2023] [Indexed: 06/29/2023] Open
Abstract
A pangenome is a collection of the common and unique genomes that are present in a given species. It combines the genetic information of all the genomes sampled, resulting in a large and diverse range of genetic material. Pangenomic analysis offers several advantages compared to traditional genomic research. For example, a pangenome is not bound by the physical constraints of a single genome, so it can capture more genetic variability. Thanks to the introduction of the concept of pangenome, it is possible to use exceedingly detailed sequence data to study the evolutionary history of two different species, or how populations within a species differ genetically. In the wake of the Human Pangenome Project, this review aims at discussing the advantages of the pangenome around human genetic variation, which are then framed around how pangenomic data can inform population genetics, phylogenetics, and public health policy by providing insights into the genetic basis of diseases or determining personalized treatments, targeting the specific genetic profile of an individual. Moreover, technical limitations, ethical concerns, and legal considerations are discussed.
Collapse
Affiliation(s)
- Paolo Abondio
- Laboratory of Ancient DNA, Department of Cultural Heritage, University of Bologna, Via degli Ariani 1, 48121 Ravenna, Italy
| | - Elisabetta Cilli
- Laboratory of Ancient DNA, Department of Cultural Heritage, University of Bologna, Via degli Ariani 1, 48121 Ravenna, Italy
| | - Donata Luiselli
- Laboratory of Ancient DNA, Department of Cultural Heritage, University of Bologna, Via degli Ariani 1, 48121 Ravenna, Italy
| |
Collapse
|
40
|
Matalon DR, Zepeda-Mendoza CJ, Aarabi M, Brown K, Fullerton SM, Kaur S, Quintero-Rivera F, Vatta M. Clinical, technical, and environmental biases influencing equitable access to clinical genetics/genomics testing: A points to consider statement of the American College of Medical Genetics and Genomics (ACMG). Genet Med 2023; 25:100812. [PMID: 37058144 DOI: 10.1016/j.gim.2023.100812] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Accepted: 02/07/2023] [Indexed: 04/15/2023] Open
Affiliation(s)
- Dena R Matalon
- Division of Medical Genetics, Department of Pediatrics, Stanford Medicine, Stanford University, Stanford, CA
| | - Cinthya J Zepeda-Mendoza
- Divisions of Hematopathology and Laboratory Genetics and Genomics, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN
| | - Mahmoud Aarabi
- UPMC Medical Genetics and Genomics Laboratories, UPMC Magee-Womens Hospital, Pittsburgh, PA; Departments of Pathology and Obstetrics, Gynecology and Reproductive Sciences, University of Pittsburgh School of Medicine, Pittsburgh, PA
| | | | - Stephanie M Fullerton
- Division of Medical Genetics, Department of Medicine, University of Washington School of Medicine, Seattle, WA; Department of Bioethics & Humanities, University of Washington School of Medicine, Seattle, WA
| | - Shagun Kaur
- Department of Child Health, Phoenix Children's Hospital, University of Arizona College of Medicine-Phoenix, Phoenix, AZ
| | - Fabiola Quintero-Rivera
- Division of Genetic and Genomic Medicine, Departments of Pathology, Laboratory Medicine, and Pediatrics, University of California Irvine, Irvine, CA
| | | |
Collapse
|
41
|
Yang S, Kim SH, Kang M, Joo JY. Harnessing deep learning into hidden mutations of neurological disorders for therapeutic challenges. Arch Pharm Res 2023:10.1007/s12272-023-01450-5. [PMID: 37261600 DOI: 10.1007/s12272-023-01450-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 05/26/2023] [Indexed: 06/02/2023]
Abstract
The relevant study of transcriptome-wide variations and neurological disorders in the evolved field of genomic data science is on the rise. Deep learning has been highlighted utilizing algorithms on massive amounts of data in a human-like manner, and is expected to predict the dependency or druggability of hidden mutations within the genome. Enormous mutational variants in coding and noncoding transcripts have been discovered along the genome by far, despite of the fine-tuned genetic proofreading machinery. These variants could be capable of inducing various pathological conditions, including neurological disorders, which require lifelong care. Several limitations and questions emerge, including the use of conventional processes via limited patient-driven sequence acquisitions and decoding-based inferences as well as how rare variants can be deduced as a population-specific etiology. These puzzles require harnessing of advanced systems for precise disease prediction, drug development and drug applications. In this review, we summarize the pathophysiological discoveries of pathogenic variants in both coding and noncoding transcripts in neurological disorders, and the current advantage of deep learning applications. In addition, we discuss the challenges encountered and how to outperform them with advancing interpretation.
Collapse
Affiliation(s)
- Sumin Yang
- Department of Pharmacy, College of Pharmacy, Hanyang University, Rm 407, Bldg.42, 55 Hanyangdaehak-Ro, Sangnok-Gu Ansan, Ansan, Gyeonggi-Do, 15588, Republic of Korea
| | - Sung-Hyun Kim
- Department of Pharmacy, College of Pharmacy, Hanyang University, Rm 407, Bldg.42, 55 Hanyangdaehak-Ro, Sangnok-Gu Ansan, Ansan, Gyeonggi-Do, 15588, Republic of Korea
| | - Mingon Kang
- Department of Computer Science, University of Nevada, Las Vegas, NV, 89154, USA
| | - Jae-Yeol Joo
- Department of Pharmacy, College of Pharmacy, Hanyang University, Rm 407, Bldg.42, 55 Hanyangdaehak-Ro, Sangnok-Gu Ansan, Ansan, Gyeonggi-Do, 15588, Republic of Korea.
| |
Collapse
|
42
|
Zhuo X, Hsu S, Purushotham D, Kuntala PK, Harrison JK, Du AY, Chen S, Li D, Wang T. Comparing genomic and epigenomic features across species using the WashU Comparative Epigenome Browser. Genome Res 2023; 33:824-835. [PMID: 37156621 PMCID: PMC10317122 DOI: 10.1101/gr.277550.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Accepted: 05/03/2023] [Indexed: 05/10/2023]
Abstract
Genome browsers have become an intuitive and critical tool to visualize and analyze genomic features and data. Conventional genome browsers display data/annotations on a single reference genome/assembly; there are also genomic alignment viewer/browsers that help users visualize alignment, mismatch, and rearrangement between syntenic regions. However, there is a growing need for a comparative epigenome browser that can display genomic and epigenomic data sets across different species and enable users to compare them between syntenic regions. Here, we present the WashU Comparative Epigenome Browser. It allows users to load functional genomic data sets/annotations mapped to different genomes and display them over syntenic regions simultaneously. The browser also displays genetic differences between the genomes from single-nucleotide variants (SNVs) to structural variants (SVs) to visualize the association between epigenomic differences and genetic differences. Instead of anchoring all data sets to the reference genome coordinates, it creates independent coordinates of different genome assemblies to faithfully present features and data mapped to different genomes. It uses a simple, intuitive genome-align track to illustrate the syntenic relationship between different species. It extends the widely used WashU Epigenome Browser infrastructure and can be expanded to support multiple species. This new browser function will greatly facilitate comparative genomic/epigenomic research, as well as support the recent growing needs to directly compare and benchmark the T2T CHM13 assembly and other human genome assemblies.
Collapse
Affiliation(s)
- Xiaoyu Zhuo
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Silas Hsu
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Deepak Purushotham
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Prashant Kumar Kuntala
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Jessica K Harrison
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Alan Y Du
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Samuel Chen
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Daofeng Li
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Ting Wang
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA;
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| |
Collapse
|
43
|
Mikhaylova V, Rzepka M, Kawamura T, Xia Y, Chang PL, Zhou S, Pham L, Modi N, Yao L, Perez-Agustin A, Pagans S, Boles TC, Lei M, Wang Y, Garcia-Bassets I, Chen Z. Targeted Phasing of 2-200 Kilobase DNA Fragments with a Short-Read Sequencer and a Single-Tube Linked-Read Library Method. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.05.531179. [PMID: 36945366 PMCID: PMC10028795 DOI: 10.1101/2023.03.05.531179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/11/2023]
Abstract
In the human genome, heterozygous sites are genomic positions with different alleles inherited from each parent. On average, there is a heterozygous site every 1-2 kilobases (kb). Resolving whether two alleles in neighboring heterozygous positions are physically linked-that is, phased-is possible with a short-read sequencer if the sequencing library captures long-range information. TELL-Seq is a library preparation method based on millions of barcoded micro-sized beads that enables instrument-free phasing of a whole human genome in a single PCR tube. TELL-Seq incorporates a unique molecular identifier (barcode) to the short reads generated from the same high-molecular-weight (HMW) DNA fragment (known as 'linked-reads'). However, genome-scale TELL-Seq is not cost-effective for applications focusing on a single locus or a few loci. Here, we present an optimized TELL-Seq protocol that enables the cost-effective phasing of enriched loci (targets) of varying sizes, purity levels, and heterozygosity. Targeted TELL-Seq maximizes linked-read efficiency and library yield while minimizing input requirements, fragment collisions on microbeads, and sequencing burden. To validate the targeted protocol, we phased seven 180-200 kb loci enriched by CRISPR/Cas9-mediated excision coupled with pulse-field electrophoresis, four 20 kb loci enriched by CRISPR/Cas9-mediated protection from exonuclease digestion, and six 2-13 kb loci amplified by PCR. The selected targets have clinical and research relevance (BRCA1, BRCA2, MLH1, MSH2, MSH6, APC, PMS2, SCN5A-SCN10A, and PKI3CA). These analyses reveal that targeted TELL-Seq provides a reliable way of phasing allelic variants within targets (2-200 kb in length) with the low cost and high accuracy of short-read sequencing.
Collapse
Affiliation(s)
| | - Madison Rzepka
- Universal Sequencing Technology Corp., Carlsbad, CA 92011, USA
| | | | - Yu Xia
- Universal Sequencing Technology Corp., Carlsbad, CA 92011, USA
| | - Peter L. Chang
- Universal Sequencing Technology Corp., Carlsbad, CA 92011, USA
| | | | - Long Pham
- Universal Sequencing Technology Corp., Carlsbad, CA 92011, USA
| | - Naisarg Modi
- Universal Sequencing Technology Corp., Carlsbad, CA 92011, USA
| | - Likun Yao
- Department of Medicine, University of California, San Diego, La Jolla, CA 92093 USA
| | - Adrian Perez-Agustin
- Department of Medical Sciences, School of Medicine, University of Girona, Girona, Spain
| | - Sara Pagans
- Department of Medical Sciences, School of Medicine, University of Girona, Girona, Spain
| | | | - Ming Lei
- Universal Sequencing Technology Corp., Canton, MA 02021, USA
| | - Yong Wang
- Universal Sequencing Technology Corp., Canton, MA 02021, USA
| | | | - Zhoutao Chen
- Universal Sequencing Technology Corp., Carlsbad, CA 92011, USA
| |
Collapse
|
44
|
Deorowicz S, Danek A, Li H. AGC: compact representation of assembled genomes with fast queries and updates. Bioinformatics 2023; 39:7067744. [PMID: 36864624 PMCID: PMC9994791 DOI: 10.1093/bioinformatics/btad097] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 01/13/2023] [Indexed: 03/04/2023] Open
Abstract
MOTIVATION High-quality sequence assembly is the ultimate representation of complete genetic information of an individual. Several ongoing pangenome projects are producing collections of high-quality assemblies of various species. Each project has already generated assemblies of hundreds of gigabytes on disk, greatly impeding the distribution of and access to such rich datasets. RESULTS Here, we show how to reduce the size of the sequenced genomes by 2-3 orders of magnitude. Our tool compresses the genomes significantly better than the existing programs and is much faster. Moreover, its unique feature is the ability to access any contig (or its part) in a fraction of a second and easily append new samples to the compressed collections. Thanks to this, AGC could be useful not only for backup or transfer purposes but also for routine analysis of pangenome sequences in common pipelines. With the rapidly reduced cost and improved accuracy of sequencing technologies, we anticipate more comprehensive pangenome projects with much larger sample sizes. AGC is likely to become a foundation tool to store, distribute and access pangenome data. AVAILABILITY AND IMPLEMENTATION The source code of AGC is available at https://github.com/refresh-bio/agc. The package can be installed via Bioconda at https://anaconda.org/bioconda/agc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sebastian Deorowicz
- Department of Algorithmics and Software, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice 44-100, Poland
| | - Agnieszka Danek
- Department of Algorithmics and Software, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice 44-100, Poland
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
45
|
Marco-Sola S, Eizenga JM, Guarracino A, Paten B, Garrison E, Moreto M. Optimal gap-affine alignment in O(s) space. Bioinformatics 2023; 39:7030690. [PMID: 36749013 PMCID: PMC9940620 DOI: 10.1093/bioinformatics/btad074] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 01/02/2023] [Indexed: 02/08/2023] Open
Abstract
MOTIVATION Pairwise sequence alignment remains a fundamental problem in computational biology and bioinformatics. Recent advances in genomics and sequencing technologies demand faster and scalable algorithms that can cope with the ever-increasing sequence lengths. Classical pairwise alignment algorithms based on dynamic programming are strongly limited by quadratic requirements in time and memory. The recently proposed wavefront alignment algorithm (WFA) introduced an efficient algorithm to perform exact gap-affine alignment in O(ns) time, where s is the optimal score and n is the sequence length. Notwithstanding these bounds, WFA's O(s2) memory requirements become computationally impractical for genome-scale alignments, leading to a need for further improvement. RESULTS In this article, we present the bidirectional WFA algorithm, the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining WFA's time complexity of O(ns). As a result, this work improves the lowest known memory bound O(n) to compute gap-affine alignments. In practice, our implementation never requires more than a few hundred MBs aligning noisy Oxford Nanopore Technologies reads up to 1 Mbp long while maintaining competitive execution times. AVAILABILITY AND IMPLEMENTATION All code is publicly available at https://github.com/smarco/BiWFA-paper. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain.,Departament d'Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona 08193, Spain
| | - Jordan M Eizenga
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Andrea Guarracino
- Genomics Research Centre, Human Technopole, Milan 20157, Italy.,Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Benedict Paten
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Miquel Moreto
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain.,Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
| |
Collapse
|
46
|
Mostajo-Radji MA. A Latin American perspective on neurodiplomacy. FRONTIERS IN MEDICAL TECHNOLOGY 2023; 4:1005043. [PMID: 36712171 PMCID: PMC9880232 DOI: 10.3389/fmedt.2022.1005043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Accepted: 12/08/2022] [Indexed: 01/15/2023] Open
Affiliation(s)
- Mohammed A. Mostajo-Radji
- UCSC Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, United States
- Live Cell Biotechnology Discovery Lab, University of California Santa Cruz, Santa Cruz, CA, United States
| |
Collapse
|
47
|
Dylus D, Altenhoff A, Majidian S, Sedlazeck FJ, Dessimoz C. Read2Tree: scalable and accurate phylogenetic trees from raw reads. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2022:2022.04.18.488678. [PMID: 36561179 PMCID: PMC9774205 DOI: 10.1101/2022.04.18.488678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
The inference of phylogenetic trees is foundational to biology. However, state-of-the-art phylogenomics requires running complex pipelines, at significant computational and labour costs, with additional constraints in sequencing coverage, assembly and annotation quality. To overcome these challenges, we present Read2Tree, which directly processes raw sequencing reads into groups of corresponding genes. In a benchmark encompassing a broad variety of datasets, our assembly-free approach was 10-100x faster than conventional approaches, and in most cases more accurate-the exception being when sequencing coverage was high and reference species very distant. To illustrate the broad applicability of the tool, we reconstructed a yeast tree of life of 435 species spanning 590 million years of evolution. Applied to Coronaviridae samples, Read2Tree accurately classified highly diverse animal samples and near-identical SARS-CoV-2 sequences on a single tree-thereby exhibiting remarkable breadth and depth. The speed, accuracy, and versatility of Read2Tree enables comparative genomics at scale.
Collapse
Affiliation(s)
- David Dylus
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
- present address: F. Hoffmann-La Roche Ltd, Immunology, Infectious Disease, and Ophthalmology (I2O), Roche Pharmaceutical Research and Early Development (pRED), Basel, 4070, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Adrian Altenhoff
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computer Science, ETH, 8092 Zurich, Switzerland
| | - Sina Majidian
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
- Department of Computer Science, Rice University, Houston, TX, 77005, USA
| | - Christophe Dessimoz
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computer Science, University College London, London WC1E 6BT, UK
- Centre for Life’s Origins and Evolution, Department of Genetics, Evolution and Environment, University College London, London WC1E, UK
| |
Collapse
|
48
|
Modzelewski AJ, Gan Chong J, Wang T, He L. Mammalian genome innovation through transposon domestication. Nat Cell Biol 2022; 24:1332-1340. [PMID: 36008480 PMCID: PMC9729749 DOI: 10.1038/s41556-022-00970-4] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2022] [Accepted: 06/27/2022] [Indexed: 01/13/2023]
Abstract
Since the discovery of transposons, their sheer abundance in host genomes has puzzled many. While historically viewed as largely harmless 'parasitic' DNAs during evolution, transposons are not a mere record of ancient genome invasion. Instead, nearly every element of transposon biology has been integrated into host biology. Here we review how host genome sequences introduced by transposon activities provide raw material for genome innovation and document the distinct evolutionary path of each species.
Collapse
Affiliation(s)
- Andrew J Modzelewski
- Division of Cellular and Developmental Biology, MCB Department, University of California, Berkeley, CA, USA
- Department of Biomedical Sciences, School of Veterinary Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Johnny Gan Chong
- Division of Cellular and Developmental Biology, MCB Department, University of California, Berkeley, CA, USA
| | - Ting Wang
- Department of Genetics, Edison Family Center for Genome Science and System Biology, McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Lin He
- Division of Cellular and Developmental Biology, MCB Department, University of California, Berkeley, CA, USA.
| |
Collapse
|
49
|
Zhou Y, Yang L, Han X, Han J, Hu Y, Li F, Xia H, Peng L, Boschiero C, Rosen BD, Bickhart DM, Zhang S, Guo A, Van Tassell CP, Smith TPL, Yang L, Liu GE. Assembly of a pangenome for global cattle reveals missing sequences and novel structural variations, providing new insights into their diversity and evolutionary history. Genome Res 2022; 32:1585-1601. [PMID: 35977842 PMCID: PMC9435747 DOI: 10.1101/gr.276550.122] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2022] [Accepted: 07/21/2022] [Indexed: 02/03/2023]
Abstract
A cattle pangenome representation was created based on the genome sequences of 898 cattle representing 57 breeds. The pangenome identified 83 Mb of sequence not found in the cattle reference genome, representing 3.1% novel sequence compared with the 2.71-Gb reference. A catalog of structural variants developed from this cattle population identified 3.3 million deletions, 0.12 million inversions, and 0.18 million duplications. Estimates of breed ancestry and hybridization between cattle breeds using insertion/deletions as markers were similar to those produced by single nucleotide polymorphism-based analysis. Hundreds of deletions were observed to have stratification based on subspecies and breed. For example, an insertion of a Bov-tA1 repeat element was identified in the first intron of the APPL2 gene and correlated with cattle breed geographic distribution. This insertion falls within a segment overlapping predicted enhancer and promoter regions of the gene, and could affect important traits such as immune response, olfactory functions, cell proliferation, and glucose metabolism in muscle. The results indicate that pangenomes are a valuable resource for studying diversity and evolutionary history, and help to delineate how domestication, trait-based breeding, and adaptive introgression have shaped the cattle genome.
Collapse
Affiliation(s)
- Yang Zhou
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Lv Yang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Xiaotao Han
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Jiazheng Han
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Yan Hu
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Fan Li
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Han Xia
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Lingwei Peng
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Clarissa Boschiero
- Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Beltsville, Maryland 20705, USA
| | - Benjamin D Rosen
- Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Beltsville, Maryland 20705, USA
| | - Derek M Bickhart
- Dairy Forage Research Center, ARS USDA, Madison, Wisconsin 53706, USA
| | - Shujun Zhang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Aizhen Guo
- The State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan 430070, China
| | - Curtis P Van Tassell
- Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Beltsville, Maryland 20705, USA
| | - Timothy P L Smith
- U.S. Meat Animal Research Center, ARS USDA, Clay Center, Nebraska 68933, USA
| | - Liguo Yang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - George E Liu
- Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Beltsville, Maryland 20705, USA
| |
Collapse
|
50
|
Löytynoja A. Thousands of human mutation clusters are explained by short-range template switching. Genome Res 2022; 32:1437-1447. [PMID: 35760560 PMCID: PMC9435742 DOI: 10.1101/gr.276478.121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Accepted: 06/21/2022] [Indexed: 02/03/2023]
Abstract
Variation within human genomes is unevenly distributed, and variants show spatial clustering. DNA replication-related template switching is a poorly known mutational mechanism capable of causing major chromosomal rearrangements as well as creating short inverted sequence copies that appear as local mutation clusters in sequence comparisons. In this study, haplotype-resolved genome assemblies representing 25 human populations and multinucleotide variants aggregated from 140,000 human sequencing experiments were reanalyzed. Local template switching could explain thousands of complex mutation clusters across the human genome, the loci segregating within and between populations. During the study, computational tools were developed for identification of template switch events using both short-read sequencing data and genotype data, and for genotyping candidate loci using short-read data. The characteristics of template-switch mutations complicate their detection, and widely used analysis pipelines for short-read sequencing data, normally capable of identifying single nucleotide changes, were found to miss template-switch mutations of tens of base pairs, potentially invalidating medical genetic studies searching for a causative allele behind genetic diseases. Combined with the massive sequencing data now available for humans, the novel tools described here enable building catalogs of affected loci and studying the cellular mechanisms behind template switching in both healthy organisms and disease.
Collapse
Affiliation(s)
- Ari Löytynoja
- Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
| |
Collapse
|