1
|
Xu L, He W, Tai S, Huang X, Qin M, Liao X, Jing Y, Yang J, Fang X, Shi J, Jin N. VCF2Dis: an ultra-fast and efficient tool to calculate pairwise genetic distance and construct population phylogeny from VCF files. Gigascience 2025; 14:giaf032. [PMID: 40184433 PMCID: PMC11970368 DOI: 10.1093/gigascience/giaf032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Revised: 01/10/2025] [Accepted: 03/04/2025] [Indexed: 04/06/2025] Open
Abstract
BACKGROUND Genetic distance metrics are crucial for understanding the evolutionary relationships and population structure of organisms. Progress in next-generation sequencing technology has given rise of genotyping data of thousands of individuals. The standard Variant Call Format (VCF) is widely used to store genomic variation information, but calculating genetic distance and constructing population phylogeny directly from large VCF files can be challenging. Moreover, the existing tools that implement such functions remain limited and have low performance in processing large-scale genotype data, especially in the area of memory efficiency. FINDINGS To address these challenges, we introduce VCF2Dis, an ultra-fast and efficient tool that calculates pairwise genetic distance directly from large VCF files and then constructs distance-based population phylogeny using the ape package. Benchmarking results demonstrate the tool's efficiency, with rapid processing times, minimal memory usage (e.g., 0.37 GB for the complete analysis of 2,504 samples with 81.2 million variants), and high accuracy, even when handling datasets with millions of variants from thousands of individuals. Its straightforward command-line interface, compatibility with downstream phylogenetic analysis tools (e.g., MEGA, Phylip, and FastTree), and support for multithreading make it a valuable tool for researchers studying population relationships. These advantages meaning VCF2Dis has already been widely utilized in many published genomic studies. CONCLUSION We present VCF2Dis, a straightforward and efficient tool for calculating genetic distance and constructing population phylogeny directly from large-scale genotype data. VCF2Dis has been widely applied, facilitating the exploration of population relationship in extensive genome sequencing studies.
Collapse
Affiliation(s)
- Lian Xu
- Institute for Translational Neuroscience of Affiliated Hospital 2 of Nantong University, Center for Neural Developmental and Degenerative Research of Nantong University, Key Laboratory of Neurodegenerative Diseases, Nantong, Jiangsu 226014, China
- Key Laboratory of Neuroregeneration, Ministry of Education and Jiangsu Province, Co-innovation Center of Neuroregeneration, NMPA Key Laboratory for Research and Evaluation of Tissue Engineering Technology Products, Nantong University, Nantong, Jiangsu 226001, China
| | - Weiming He
- BGI Research, Shenzhen 518083, China
- BGI Research, Sanya 572025, China
| | | | - Xiaoli Huang
- Institute for Translational Neuroscience of Affiliated Hospital 2 of Nantong University, Center for Neural Developmental and Degenerative Research of Nantong University, Key Laboratory of Neurodegenerative Diseases, Nantong, Jiangsu 226014, China
| | - Mumu Qin
- BGI Research, Sanya 572025, China
| | - Xun Liao
- BGI Research, Shenzhen 518083, China
| | - Yi Jing
- BGI Research, Sanya 572025, China
| | - Jian Yang
- Key Laboratory of Neuroregeneration, Ministry of Education and Jiangsu Province, Co-innovation Center of Neuroregeneration, NMPA Key Laboratory for Research and Evaluation of Tissue Engineering Technology Products, Nantong University, Nantong, Jiangsu 226001, China
| | - Xiaodong Fang
- BGI Research, Shenzhen 518083, China
- BGI Research, Sanya 572025, China
| | - Jianhua Shi
- Institute for Translational Neuroscience of Affiliated Hospital 2 of Nantong University, Center for Neural Developmental and Degenerative Research of Nantong University, Key Laboratory of Neurodegenerative Diseases, Nantong, Jiangsu 226014, China
| | - Nana Jin
- Institute for Translational Neuroscience of Affiliated Hospital 2 of Nantong University, Center for Neural Developmental and Degenerative Research of Nantong University, Key Laboratory of Neurodegenerative Diseases, Nantong, Jiangsu 226014, China
- Key Laboratory of Neuroregeneration, Ministry of Education and Jiangsu Province, Co-innovation Center of Neuroregeneration, NMPA Key Laboratory for Research and Evaluation of Tissue Engineering Technology Products, Nantong University, Nantong, Jiangsu 226001, China
| |
Collapse
|
2
|
Veilleux CC, Garrett EC, Pajic P, Saitou M, Ochieng J, Dagsaan LD, Dominy NJ, Perry GH, Gokcumen O, Melin AD. Human subsistence and signatures of selection on chemosensory genes. Commun Biol 2023; 6:683. [PMID: 37400713 DOI: 10.1038/s42003-023-05047-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Accepted: 06/15/2023] [Indexed: 07/05/2023] Open
Abstract
Chemosensation (olfaction, taste) is essential for detecting and assessing foods, such that dietary shifts elicit evolutionary changes in vertebrate chemosensory genes. The transition from hunting and gathering to agriculture dramatically altered how humans acquire food. Recent genetic and linguistic studies suggest agriculture may have precipitated olfactory degeneration. Here, we explore the effects of subsistence behaviors on olfactory (OR) and taste (TASR) receptor genes among rainforest foragers and neighboring agriculturalists in Africa and Southeast Asia. We analyze 378 functional OR and 26 functional TASR genes in 133 individuals across populations in Uganda (Twa, Sua, BaKiga) and the Philippines (Agta, Mamanwa, Manobo) with differing subsistence histories. We find no evidence of relaxed selection on chemosensory genes in agricultural populations. However, we identify subsistence-related signatures of local adaptation on chemosensory genes within each geographic region. Our results highlight the importance of culture, subsistence economy, and drift in human chemosensory perception.
Collapse
Affiliation(s)
- Carrie C Veilleux
- Department of Anatomy, Midwestern University, 19555 N 59th Ave, Glendale, AZ, 85308, USA.
- Department of Anthropology & Archaeology, University of Calgary, 2500 University Drive NW, Calgary, AB, T2N 1N4, Canada.
| | - Eva C Garrett
- Department of Anthropology & Archaeology, University of Calgary, 2500 University Drive NW, Calgary, AB, T2N 1N4, Canada
- Department of Anthropology, Boston University, 232 Bay State Road, Boston, MA, 02215, USA
| | - Petar Pajic
- Department of Biological Sciences, University at Buffalo, 109 Cooke Hall, Buffalo, NY, 14260, USA
| | - Marie Saitou
- Department of Biological Sciences, University at Buffalo, 109 Cooke Hall, Buffalo, NY, 14260, USA
| | - Joseph Ochieng
- Department of Anatomy, Makerere University College of Health Sciences, Kampala, Uganda
| | - Lilia D Dagsaan
- National Commission for Indigenous Peoples, Botolan, Philippines
| | - Nathaniel J Dominy
- Department of Anthropology, Dartmouth College, 6047 Silsby Hall, Hanover, NH, 03755, USA
| | - George H Perry
- Departments of Anthropology and Biology, The Pennsylvania State University, 410 Carpenter Building, University Park, PA, 16802, USA
| | - Omer Gokcumen
- Department of Biological Sciences, University at Buffalo, 109 Cooke Hall, Buffalo, NY, 14260, USA
| | - Amanda D Melin
- Department of Anthropology & Archaeology, University of Calgary, 2500 University Drive NW, Calgary, AB, T2N 1N4, Canada.
- Department of Medical Genetics, University of Calgary, 3330 Hospital Drive NW, Calgary, AB, T2N 4N1, Canada.
- Alberta Children's Hospital Research Institute, 3330 Hospital Dr. NW, Calgary, AB, T2N 4N1, Canada.
| |
Collapse
|
3
|
Saitou M, Masuda N, Gokcumen O. Similarity-Based Analysis of Allele Frequency Distribution among Multiple Populations Identifies Adaptive Genomic Structural Variants. Mol Biol Evol 2022; 39:msab313. [PMID: 34718708 PMCID: PMC8896759 DOI: 10.1093/molbev/msab313] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Structural variants have a considerable impact on human genomic diversity. However, their evolutionary history remains mostly unexplored. Here, we developed a new method to identify potentially adaptive structural variants based on a similarity-based analysis that incorporates genotype frequency data from 26 populations simultaneously. Using this method, we analyzed 57,629 structural variants and identified 576 structural variants that show unusual population differentiation. Of these putatively adaptive structural variants, we further showed that 24 variants are multiallelic and overlap with coding sequences, and 20 variants are significantly associated with GWAS traits. Closer inspection of the haplotypic variation associated with these putatively adaptive and functional structural variants reveals deviations from neutral expectations due to: 1) population differentiation of rapidly evolving multiallelic variants, 2) incomplete sweeps, and 3) recent population-specific negative selection. Overall, our study provides new methodological insights, documents hundreds of putatively adaptive variants, and introduces evolutionary models that may better explain the complex evolution of structural variants.
Collapse
Affiliation(s)
- Marie Saitou
- Department of Biological Sciences, University at Buffalo, State University of New York, Buffalo, NY, USA
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL, USA
| | - Naoki Masuda
- Department of Mathematics, University at Buffalo, State University of New York, Buffalo, NY, USA
- Computational and Data-Enabled Science and Engineering Program, University at Buffalo, State University of New York, Buffalo, NY, USA
| | - Omer Gokcumen
- Department of Biological Sciences, University at Buffalo, State University of New York, Buffalo, NY, USA
| |
Collapse
|
4
|
Saitou M, Resendez S, Pradhan AJ, Wu F, Lie NC, Hall NJ, Zhu Q, Reinholdt L, Satta Y, Speidel L, Nakagome S, Hanchard NA, Churchill G, Lee C, Atilla-Gokcumen GE, Mu X, Gokcumen O. Sex-specific phenotypic effects and evolutionary history of an ancient polymorphic deletion of the human growth hormone receptor. SCIENCE ADVANCES 2021; 7:eabi4476. [PMID: 34559564 PMCID: PMC8462886 DOI: 10.1126/sciadv.abi4476] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Accepted: 08/04/2021] [Indexed: 06/13/2023]
Abstract
The common deletion of the third exon of the growth hormone receptor gene (GHRd3) in humans is associated with birth weight, growth after birth, and time of puberty. However, its evolutionary history and the molecular mechanisms through which it affects phenotypes remain unresolved. We present evidence that this deletion was nearly fixed in the ancestral population of anatomically modern humans and Neanderthals but underwent a recent adaptive reduction in frequency in East Asia. We documented that GHRd3 is associated with protection from severe malnutrition. Using a novel mouse model, we found that, under calorie restriction, Ghrd3 leads to the female-like gene expression in male livers and the disappearance of sexual dimorphism in weight. The sex- and diet-dependent effects of GHRd3 in our mouse model are consistent with a model in which the allele frequency of GHRd3 varies throughout human evolution as a response to fluctuations in resource availability.
Collapse
Affiliation(s)
- Marie Saitou
- Department of Biological Sciences, University at Buffalo, Buffalo, NY, USA
| | - Skyler Resendez
- Department of Biological Sciences, University at Buffalo, Buffalo, NY, USA
| | | | - Fuguo Wu
- Department of Ophthalmology, Ross Eye Institute, Jacobs School of Medicine and Biological Sciences, University at Buffalo, Buffalo, NY, USA
| | - Natasha C. Lie
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Nancy J. Hall
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Qihui Zhu
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | | | - Yoko Satta
- Department of Evolutionary Studies of Biosystems, SOKENDAI (Graduate University for Advanced Studies), Kanagawa Prefecture, Japan
| | - Leo Speidel
- University College London, Genetics Institute, London, UK
- The Francis Crick Institute, London, UK
| | | | - Neil A. Hanchard
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | | | - Charles Lee
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
- Precision Medicine Center, The First Affiliated Hospital of Xi’an Jiaotong University, Shaanxi, People’s Republic of China
| | | | - Xiuqian Mu
- Department of Ophthalmology, Ross Eye Institute, Jacobs School of Medicine and Biological Sciences, University at Buffalo, Buffalo, NY, USA
| | - Omer Gokcumen
- Department of Biological Sciences, University at Buffalo, Buffalo, NY, USA
| |
Collapse
|
5
|
Course MM, Gudsnuk K, Smukowski SN, Winston K, Desai N, Ross JP, Sulovari A, Bourassa CV, Spiegelman D, Couthouis J, Yu CE, Tsuang DW, Jayadev S, Kay MA, Gitler AD, Dupre N, Eichler EE, Dion PA, Rouleau GA, Valdmanis PN. Evolution of a Human-Specific Tandem Repeat Associated with ALS. Am J Hum Genet 2020; 107:445-460. [PMID: 32750315 DOI: 10.1016/j.ajhg.2020.07.004] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Accepted: 07/08/2020] [Indexed: 12/12/2022] Open
Abstract
Tandem repeats are proposed to contribute to human-specific traits, and more than 40 tandem repeat expansions are known to cause neurological disease. Here, we characterize a human-specific 69 bp variable number tandem repeat (VNTR) in the last intron of WDR7, which exhibits striking variability in both copy number and nucleotide composition, as revealed by long-read sequencing. In addition, greater repeat copy number is significantly enriched in three independent cohorts of individuals with sporadic amyotrophic lateral sclerosis (ALS). Each unit of the repeat forms a stem-loop structure with the potential to produce microRNAs, and the repeat RNA can aggregate when expressed in cells. We leveraged its remarkable sequence variability to align the repeat in 288 samples and uncover its mechanism of expansion. We found that the repeat expands in the 3'-5' direction, in groups of repeat units divisible by two. The expansion patterns we observed were consistent with duplication events, and a replication error called template switching. We also observed that the VNTR is expanded in both Denisovan and Neanderthal genomes but is fixed at one copy or fewer in non-human primates. Evaluating the repeat in 1000 Genomes Project samples reveals that some repeat segments are solely present or absent in certain geographic populations. The large size of the repeat unit in this VNTR, along with our multiplexed sequencing strategy, provides an unprecedented opportunity to study mechanisms of repeat expansion, and a framework for evaluating the roles of VNTRs in human evolution and disease.
Collapse
|
6
|
Subramanian S, Ramasamy U, Chen D. VCF2PopTree: a client-side software to construct population phylogeny from genome-wide SNPs. PeerJ 2019; 7:e8213. [PMID: 31824783 PMCID: PMC6901002 DOI: 10.7717/peerj.8213] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Accepted: 11/14/2019] [Indexed: 02/02/2023] Open
Abstract
In the past decades a number of software programs have been developed to infer phylogenetic relationships between populations. However, most of these programs typically use alignments of sequences from genes to build phylogeny. Recently, many standalone or web applications have been developed to handle large-scale whole genome data, but they are either computationally intensive, dependent on third party software or required significant time and resource of a web server. In the post-genomic era, researchers are able to obtain bioinformatically processed high-quality publication-ready whole genome data for many individuals in a population from next generation sequencing companies due to the reduction in the cost of sequencing and analysis. Such genotype data is typically presented in the Variant Call Format (VCF) and there is no simple software available that directly uses this data format to construct the phylogeny of populations in a short time. To address this limitation, we have developed a user-friendly software, VCF2PopTree that uses genome-wide SNPs to construct and display phylogenetic trees in seconds to minutes. For example, it reads a VCF file containing 4 million SNPs and draws a tree in less than 30 seconds. VCF2PopTree accepts genotype data from a local machine, constructs a tree using UPGMA and Neighbour-Joining algorithms and displays it on a web-browser. It also produces pairwise-diversity matrix in MEGA and PHYLIP file formats as well as trees in the Newick format which could be directly used by other popular phylogenetic software programs. The software including the source code, a test VCF file and a documentation are available at: https://github.com/sansubs/vcf2pop.
Collapse
Affiliation(s)
- Sankar Subramanian
- GeneCology Research Centre, The University of the Sunshine Coast, Sippy Downs, QLD, Australia
| | - Umayal Ramasamy
- GeneCology Research Centre, The University of the Sunshine Coast, Sippy Downs, QLD, Australia
| | - David Chen
- School of Information and Communication Technology, Griffith University, Nathan, QLD, Australia
| |
Collapse
|
7
|
Saitou M, Gokcumen O. Resolving the Insertion Sites of Polymorphic Duplications Reveals a HERC2 Haplotype under Selection. Genome Biol Evol 2019; 11:1679-1690. [PMID: 31124564 PMCID: PMC6587411 DOI: 10.1093/gbe/evz107] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/19/2019] [Indexed: 12/18/2022] Open
Abstract
Polymorphic duplications in humans have been shown to contribute to phenotypic diversity. However, the evolutionary forces that maintain variable duplications across the human genome are largely unexplored. We developed a linkage-disequilibrium based method to detect insertion sites of polymorphic duplications not represented in reference genomes. This method also allows resolution of haplotypes harboring the duplications. Using this approach, we conducted genome-wide analyses and identified the insertion sites of 22 common polymorphic duplications. We found that the majority of these duplications is intrachromosomal and only one of them is an interchromosomal insertion. Further characterization of these duplications revealed significant associations to blood and skin phenotypes. On the basis of population genetics analyses, we found that the duplication of a well-characterized pigmentation-related region, including the HERC2 gene, may be selected against in European populations. We further demonstrated that the haplotype harboring this duplication significantly affects the expression of the HERC2P9 gene in multiple tissues. Our study sheds light onto the evolutionary impact of understudied polymorphic duplications in human populations and presents methodological insights for future studies.
Collapse
Affiliation(s)
- Marie Saitou
- Department of Biological Sciences, SUNY at Buffalo
| | | |
Collapse
|
8
|
Pajic P, Pavlidis P, Dean K, Neznanova L, Romano RA, Garneau D, Daugherity E, Globig A, Ruhl S, Gokcumen O. Independent amylase gene copy number bursts correlate with dietary preferences in mammals. eLife 2019; 8:e44628. [PMID: 31084707 PMCID: PMC6516957 DOI: 10.7554/elife.44628] [Citation(s) in RCA: 71] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Accepted: 04/07/2019] [Indexed: 12/28/2022] Open
Abstract
The amylase gene (AMY), which codes for a starch-digesting enzyme in animals, underwent several gene copy number gains in humans (Perry et al., 2007), dogs (Axelsson et al., 2013), and mice (Schibler et al., 1982), possibly along with increased starch consumption during the evolution of these species. Here, we present comprehensive evidence for AMY copy number expansions that independently occurred in several mammalian species which consume diets rich in starch. We also provide correlative evidence that AMY gene duplications may be an essential first step for amylase to be expressed in saliva. Our findings underscore the overall importance of gene copy number amplification as a flexible and fast evolutionary mechanism that can independently occur in different branches of the phylogeny.
Collapse
Affiliation(s)
- Petar Pajic
- Department of Biological SciencesUniversity at Buffalo, The State University of New YorkNew YorkUnited States
- Department of Oral Biology, School of Dental MedicineUniversity at Buffalo, The State University of New YorkNew YorkUnited States
| | - Pavlos Pavlidis
- Institute of Computer Science (ICS)Foundation for Research and Technology – HellasHeraklionGreece
| | - Kirsten Dean
- Department of Biological SciencesUniversity at Buffalo, The State University of New YorkNew YorkUnited States
| | - Lubov Neznanova
- Department of Oral Biology, School of Dental MedicineUniversity at Buffalo, The State University of New YorkNew YorkUnited States
| | - Rose-Anne Romano
- Department of Oral Biology, School of Dental MedicineUniversity at Buffalo, The State University of New YorkNew YorkUnited States
| | - Danielle Garneau
- Center for Earth and Environmental SciencePlattsburgh State UniversityNew YorkUnited States
| | - Erin Daugherity
- Cornell Center for Animal Resources and EducationCornell UniversityNew YorkUnited States
| | - Anja Globig
- Friedrich-Loeffler-Institut, Federal Research Institute for Animal HealthGreifswaldGermany
| | - Stefan Ruhl
- Department of Oral Biology, School of Dental MedicineUniversity at Buffalo, The State University of New YorkNew YorkUnited States
| | - Omer Gokcumen
- Department of Biological SciencesUniversity at Buffalo, The State University of New YorkNew YorkUnited States
| |
Collapse
|
9
|
Complex Haplotypes of GSTM1 Gene Deletions Harbor Signatures of a Selective Sweep in East Asian Populations. G3-GENES GENOMES GENETICS 2018; 8:2953-2966. [PMID: 30061374 PMCID: PMC6118300 DOI: 10.1534/g3.118.200462] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
The deletion of the metabolizing Glutathione S-transferase Mu 1 (GSTM1) gene has been associated with multiple cancers, metabolic and autoimmune disorders, as well as drug response. It is unusually common, with allele frequency reaching up to 75% in some human populations. Such high allele frequency of a derived allele with apparent impact on an otherwise conserved gene is a rare phenomenon. To investigate the evolutionary history of this locus, we analyzed 310 genomes using population genetics tools. Our analysis revealed a surprising lack of linkage disequilibrium between the deletion and the flanking single nucleotide variants in this locus. Tests that measure extended homozygosity and rapid change in allele frequency revealed signatures of an incomplete sweep in the locus. Using empirical approaches, we identified the Tanuki haplogroup, which carries the GSTM1 deletion and is found in approximately 70% of East Asian chromosomes. This haplogroup has rapidly increased in frequency in East Asian populations, contributing to a high population differentiation among continental human groups. We showed that extended homozygosity and population differentiation for this haplogroup is incompatible with simulated neutral expectations in East Asian populations. In parallel, we found that the Tanuki haplogroup is significantly associated with the expression levels of other GSTM genes. Collectively, our results suggest that standing variation in this locus has likely undergone an incomplete sweep in East Asia with regulatory impact on multiple GSTM genes. Our study provides the necessary framework for further studies to elucidate the evolutionary reasons that maintain disease-susceptibility variants in the GSTM1 locus.
Collapse
|