1
|
Freedman AH, Sackton TB. Building better genome annotations across the tree of life. Genome Res 2025; 35:1261-1276. [PMID: 40234028 PMCID: PMC12047660 DOI: 10.1101/gr.280377.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2024] [Accepted: 03/03/2025] [Indexed: 04/17/2025]
Abstract
Recent technological advances in long-read DNA sequencing accompanied by reduction in costs have made the production of genome assemblies financially achievable and computationally feasible, such that genome assembly no longer represents the major hurdle to evolutionary analysis for most nonmodel organisms. Now, the more difficult challenge is to properly annotate a draft genome assembly once it has been constructed. The primary challenge to annotations is how to select from the myriad gene prediction tools that are currently available, determine what kinds of data are necessary to generate high-quality annotations, and evaluate the quality of the annotation. To determine which methods perform the best and to determine whether the inclusion of RNA-seq data is necessary to obtain a high-quality annotation, we generated annotations with 12 different methods for 21 different species spanning vertebrates, plants, and insects. We found that the annotation transfer method TOGA, BRAKER3, and the RNA-seq assembler StringTie were consistently top performers across a variety of metrics including BUSCO recovery, CDS length, and false-positive rate, with the exception that TOGA performed less well in some monocots with respect to BUSCO recovery. The choice of which of the top-performing methods will depend upon the feasibility of whole-genome alignment, availability of RNA-seq data, importance of capturing noncoding parts of the transcriptome, and, when whole-genome alignment is not feasible, the relative performance in BUSCO recovery between BRAKER3 and StringTie. When whole-genome alignment is not feasible, inclusion of RNA-seq data will lead to substantial improvements to genome annotations.
Collapse
Affiliation(s)
- Adam H Freedman
- Informatics Group, Faculty of Arts and Sciences, Harvard University, Cambridge, Massachusetts 02138, USA
| | - Timothy B Sackton
- Informatics Group, Faculty of Arts and Sciences, Harvard University, Cambridge, Massachusetts 02138, USA
| |
Collapse
|
2
|
Tao XY, Feng SL, Yuan L, Li YJ, Li XJ, Guan XY, Chen ZH, Xu SC. Harnessing transposable elements for plant functional genomics and genome engineering. TRENDS IN PLANT SCIENCE 2025:S1360-1385(25)00067-6. [PMID: 40240259 DOI: 10.1016/j.tplants.2025.03.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/06/2025] [Revised: 03/04/2025] [Accepted: 03/17/2025] [Indexed: 04/18/2025]
Abstract
Transposable elements (TEs) constitute a large portion of many plant genomes and play important roles in regulating gene expression and in driving genome evolution and crop domestication. Despite advances in understanding the functions and mechanisms of TEs, a comprehensive review of their integrated knowledge and cutting-edge biotechnological applications of TEs is still needed. We provide a thorough overview that connects discoveries, mechanisms, and technologies associated with plant TEs. We discuss the identification and function of TEs driven by functional genomics, epigenetic regulation of TEs, and utilization of active TEs in plant functional genomics and genome engineering. In summary, expanding the knowledge and application of TEs will be beneficial to crop breeding and plant synthetic biology in the future.
Collapse
Affiliation(s)
| | | | - Lu Yuan
- Xianghu Laboratory, Hangzhou 311231, China
| | - Yan-Jun Li
- Xianghu Laboratory, Hangzhou 311231, China
| | - Xin-Jia Li
- Xianghu Laboratory, Hangzhou 311231, China
| | - Xue-Ying Guan
- College of Agriculture and Biotechnology, Zhejiang University, Hangzhou 310058, China
| | - Zhong-Hua Chen
- School of Science, Western Sydney University, Penrith, NSW, Australia; School of Agriculture, Food and Wine, The University of Adelaide, Glen Osmond, 5064 SA, Australia.
| | - Sheng-Chun Xu
- Xianghu Laboratory, Hangzhou 311231, China; Institute of Digital Agriculture, Zhejiang Academy of Agricultural Science, Hangzhou, China.
| |
Collapse
|
3
|
Benegas G, Ye C, Albors C, Li JC, Song YS. Genomic language models: opportunities and challenges. Trends Genet 2025; 41:286-302. [PMID: 39753409 DOI: 10.1016/j.tig.2024.11.013] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 11/21/2024] [Accepted: 11/21/2024] [Indexed: 04/10/2025]
Abstract
Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of natural language processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic language models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.
Collapse
Affiliation(s)
- Gonzalo Benegas
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Chengzhong Ye
- Department of Statistics, University of California, Berkeley, CA, USA
| | - Carlos Albors
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Jianan Canal Li
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Yun S Song
- Computer Science Division, University of California, Berkeley, CA, USA; Department of Statistics, University of California, Berkeley, CA, USA; Center for Computational Biology, University of California, Berkeley, CA, USA.
| |
Collapse
|
4
|
Wei W, Gui S, Yang J, Garrison E, Yan J, Liu HJ. wgatools: an ultrafast toolkit for manipulating whole-genome alignments. Bioinformatics 2025; 41:btaf132. [PMID: 40152239 PMCID: PMC11978383 DOI: 10.1093/bioinformatics/btaf132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Revised: 02/17/2025] [Accepted: 03/26/2025] [Indexed: 03/29/2025] Open
Abstract
SUMMARY With the rapid development of long-read sequencing technologies, the era of individual complete genomes is approaching. We have developed wgatools, a cross-platform, ultrafast toolkit that supports a range of whole-genome alignment formats, offering practical tools for conversion, processing, evaluation, and visualization of alignments, thereby facilitating population-level genome analysis and advancing functional and evolutionary genomics. AVAILABILITY AND IMPLEMENTATION wgatools supports diverse formats and can process, filter, and statistically evaluate alignments, perform alignment-based variant calling, and visualize alignments both locally and genome-wide. Built with Rust for efficiency and safe memory usage, it ensures fast performance and can handle large datasets consisting of hundreds of genomes. wgatools is published as free software under the MIT open-source license, and its source code is freely available at https://github.com/wjwei-handsome/wgatools and https://zenodo.org/records/14882797.
Collapse
Affiliation(s)
- Wenjie Wei
- School of Life Sciences, Westlake University, Hangzhou 310030, China
- National Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Songtao Gui
- National Key Laboratory of Wheat Improvement, College of Life Sciences, Shandong Agricultural University, Tai’an 271018, China
| | - Jian Yang
- School of Life Sciences, Westlake University, Hangzhou 310030, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou 310024, China
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, United States
| | - Jianbing Yan
- National Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Hai-Jun Liu
- Yazhouwan National Laboratory, Sanya 572024, China
| |
Collapse
|
5
|
Roberts MD, Davis O, Josephs EB, Williamson RJ. K-mer-based Approaches to Bridging Pangenomics and Population Genetics. Mol Biol Evol 2025; 42:msaf047. [PMID: 40111256 PMCID: PMC11925024 DOI: 10.1093/molbev/msaf047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Revised: 01/10/2025] [Accepted: 02/04/2025] [Indexed: 03/12/2025] Open
Abstract
Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning references to build pangenomes can be challenging for many species, limiting our ability to study this missing genomic variation in population genetics. Here, we argue that k-mers are a very useful but underutilized tool for bridging the reference-focused paradigms of population genetics with the reference-free paradigms of pangenomics. We review current literature on the uses of k-mers for performing three core components of most population genetics analyses: identifying, measuring, and explaining patterns of genetic variation. We also demonstrate how different k-mer-based measures of genetic variation behave in population genetic simulations according to the choice of k, depth of sequencing coverage, and degree of data compression. Overall, we find that k-mer-based measures of genetic diversity scale consistently with pairwise nucleotide diversity (π) up to values of about π=0.025 (R2=0.97) for neutrally evolving populations. For populations with even more variation, using shorter k-mers will maintain the scalability up to at least π=0.1. Furthermore, in our simulated populations, k-mer dissimilarity values can be reliably approximated from counting bloom filters, highlighting a potential avenue to decreasing the memory burden of k-mer-based genomic dissimilarity analyses. For future studies, there is a great opportunity to further develop methods to identifying selected loci using k-mers.
Collapse
Affiliation(s)
- Miles D Roberts
- Genetics and Genome Sciences Program, Michigan State University, East Lansing, MI 48824, USA
| | - Olivia Davis
- Department of Computer Science and Software Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
| | - Emily B Josephs
- Department of Plant Biology, Michigan State University, East Lansing, MI 48824, USA
- Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI 48824, USA
- Plant Resilience Institute, Michigan State University, East Lansing, MI 48824, USA
| | - Robert J Williamson
- Department of Computer Science and Software Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
- Department of Biology and Biomedical Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
| |
Collapse
|
6
|
Li H, Durbin R. Genome assembly in the telomere-to-telomere era. Nat Rev Genet 2024; 25:658-670. [PMID: 38649458 DOI: 10.1038/s41576-024-00718-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2024] [Indexed: 04/25/2024]
Abstract
Genome sequences largely determine the biology and encode the history of an organism, and de novo assembly - the process of reconstructing the genome sequence of an organism from sequencing reads - has been a central problem in bioinformatics for four decades. Until recently, genomes were typically assembled into fragments of a few megabases at best, but now technological advances in long-read sequencing enable the near-complete assembly of each chromosome - also known as telomere-to-telomere assembly - for many organisms. Here, we review recent progress on assembly algorithms and protocols, with a focus on how to derive near-telomere-to-telomere assemblies. We also discuss the additional developments that will be required to resolve remaining assembly gaps and to assemble non-diploid genomes.
Collapse
Affiliation(s)
- Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Richard Durbin
- Department of Genetics, Cambridge University, Cambridge, UK.
| |
Collapse
|
7
|
Zhai J, Gokaslan A, Schiff Y, Berthel A, Liu ZY, Lai WY, Miller ZR, Scheben A, Stitzer MC, Romay MC, Buckler ES, Kuleshov V. Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.04.596709. [PMID: 38895432 PMCID: PMC11185591 DOI: 10.1101/2024.06.04.596709] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a curated dataset of 16 Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks, including predicting translation initiation/termination sites and splice donor and acceptor sites, demonstrated high transferability to 160 million year diverged maize, outperforming the best existing DNA LM by 1.45 to 7.23-fold. PlantCaduceus is competitive to state-of-the-art protein LMs in terms of deleterious mutation identification, and is threefold better than PhyloP. Additionally, PlantCaduceus successfully identifies well-known causal variants in both Arabidopsis and maize. Overall, PlantCaduceus is a versatile DNA LM that can accelerate plant genomics and crop breeding applications.
Collapse
Affiliation(s)
- Jingjing Zhai
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Aaron Gokaslan
- Department of Computer Science, Cornell University, Ithaca, NY, USA 14853
| | - Yair Schiff
- Department of Computer Science, Cornell University, Ithaca, NY, USA 14853
| | - Ana Berthel
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Zong-Yan Liu
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY USA 14853
| | - Wei-Yun Lai
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Zachary R. Miller
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Armin Scheben
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY USA 11724
| | | | - M. Cinta Romay
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Edward S. Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY USA 14853
- USDA-ARS; Ithaca, NY, USA 14853
| | - Volodymyr Kuleshov
- Department of Computer Science, Cornell University, Ithaca, NY, USA 14853
| |
Collapse
|
8
|
Schreiber M, Jayakodi M, Stein N, Mascher M. Plant pangenomes for crop improvement, biodiversity and evolution. Nat Rev Genet 2024; 25:563-577. [PMID: 38378816 PMCID: PMC7616794 DOI: 10.1038/s41576-024-00691-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/14/2023] [Indexed: 02/22/2024]
Abstract
Plant genome sequences catalogue genes and the genetic elements that regulate their expression. Such inventories further research aims as diverse as mapping the molecular basis of trait diversity in domesticated plants or inquiries into the origin of evolutionary innovations in flowering plants millions of years ago. The transformative technological progress of DNA sequencing in the past two decades has enabled researchers to sequence ever more genomes with greater ease. Pangenomes - complete sequences of multiple individuals of a species or higher taxonomic unit - have now entered the geneticists' toolkit. The genomes of crop plants and their wild relatives are being studied with translational applications in breeding in mind. But pangenomes are applicable also in ecological and evolutionary studies, as they help classify and monitor biodiversity across the tree of life, deepen our understanding of how plant species diverged and show how plants adapt to changing environments or new selection pressures exerted by human beings.
Collapse
Affiliation(s)
- Mona Schreiber
- Department of Biology, University of Marburg, Marburg, Germany
| | - Murukarthick Jayakodi
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
- Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| |
Collapse
|
9
|
Conover JL, Grover CE, Sharbrough J, Sloan DB, Peterson DG, Wendel JF. Little evidence for homoeologous gene conversion and homoeologous exchange events in Gossypium allopolyploids. AMERICAN JOURNAL OF BOTANY 2024; 111:e16386. [PMID: 39107998 DOI: 10.1002/ajb2.16386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 07/01/2024] [Accepted: 07/01/2024] [Indexed: 08/24/2024]
Abstract
PREMISE A complicating factor in analyzing allopolyploid genomes is the possibility of physical interactions between homoeologous chromosomes during meiosis, resulting in either crossover (homoeologous exchanges) or non-crossover products (homoeologous gene conversion). Homoeologous gene conversion was first described in cotton by comparing SNP patterns in sequences from two diploid progenitors with those from the allopolyploid subgenomes. These analyses, however, did not explicitly consider other evolutionary scenarios that may give rise to similar SNP patterns as homoeologous gene conversion, creating uncertainties about the reality of the inferred gene conversion events. METHODS Here, we use an expanded phylogenetic sampling of high-quality genome assemblies from seven allopolyploid Gossypium species (all derived from the same polyploidy event), four diploid species (two closely related to each subgenome), and a diploid outgroup to derive a robust method for identifying potential genomic regions of gene conversion and homoeologous exchange. RESULTS We found little evidence for homoeologous gene conversion in allopolyploid cottons, and that only two of the 40 best-supported events were shared by more than one species. We did, however, reveal a single, shared homoeologous exchange event at one end of chromosome 1, which occurred shortly after allopolyploidization but prior to divergence of the descendant species. CONCLUSIONS Overall, our analyses demonstrated that homoeologous gene conversion and homoeologous exchanges are uncommon in Gossypium, affecting between zero and 24 genes per subgenome (0.0-0.065%) across the seven species. More generally, we highlighted the potential problems of using simple four-taxon tests to investigate patterns of homoeologous gene conversion in established allopolyploids.
Collapse
Affiliation(s)
- Justin L Conover
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, 50010, IA, USA
- Ecology and Evolutionary Biology Department, University of Arizona, Tucson, 85718, AZ, USA
- Molecular and Cellular Biology Department, University of Arizona, Tucson, 85718, AZ, USA
| | - Corrinne E Grover
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, 50010, IA, USA
| | - Joel Sharbrough
- Biology Department, New Mexico Institute of Mining and Technology, Socorro, 87801, NM, USA
| | - Daniel B Sloan
- Biology Department, Colorado State University, Fort Collins, 80521, CO, USA
| | - Daniel G Peterson
- Institute for Genomics, Biocomputing & Biotechnology, Mississippi State University, Mississippi State, 39762, MS, USA
| | - Jonathan F Wendel
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, 50010, IA, USA
| |
Collapse
|
10
|
Banović Đeri B, Nešić S, Vićić I, Samardžić J, Nikolić D. Benchmarking of five NGS mapping tools for the reference alignment of bacterial outer membrane vesicles-associated small RNAs. Front Microbiol 2024; 15:1401985. [PMID: 39101033 PMCID: PMC11294920 DOI: 10.3389/fmicb.2024.1401985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2024] [Accepted: 07/01/2024] [Indexed: 08/06/2024] Open
Abstract
Advances in small RNAs (sRNAs)-related studies have posed a challenge for NGS-related bioinformatics, especially regarding the correct mapping of sRNAs. Depending on the algorithms and scoring matrices on which they are based, aligners are influenced by the characteristics of the dataset and the reference genome. These influences have been studied mainly in eukaryotes and to some extent in prokaryotes. However, in bacteria, the selection of aligners depending on sRNA-seq data associated with outer membrane vesicles (OMVs) and the features of the corresponding bacterial reference genome has not yet been investigated. We selected five aligners: BBmap, Bowtie2, BWA, Minimap2 and Segemehl, known for their generally good performance, to test them in mapping OMV-associated sRNAs from Aliivibrio fischeri to the bacterial reference genome. Significant differences in the performance of the five aligners were observed, resulting in differential recognition of OMV-associated sRNA biotypes in A. fischeri. Our results suggest that aligner(s) should not be arbitrarily selected for this task, which is often done, as this can be detrimental to the biological interpretation of NGS analysis results. Since each aligner has specific advantages and disadvantages, these need to be considered depending on the characteristics of the input OMV sRNAs dataset and the corresponding bacterial reference genome to improve the detection of existing, biologically important OMV sRNAs. Until we learn more about these dependencies, we recommend using at least two, preferably three, aligners that have good metrics for the given dataset/bacterial reference genome. The overlapping results should be considered trustworthy, yet their differences should not be dismissed lightly, but treated carefully in order not to overlook any biologically important OMV sRNA. This can be achieved by applying the intersect-then-combine approach. For the mapping of OMV-associated sRNAs of A. fischeri to the reference genome organized into two circular chromosomes and one circular plasmid, containing copies of sequences with rRNA- and tRNA-related features and no copies of sequences with protein-encoding features, if the aligners are used with their default parameters, we advise avoiding Segemehl, and recommend using the intersect-then-combine approach with BBmap, BWA and Minimap2 to improve the potential for discovery of biologically important OMV-associated sRNAs.
Collapse
Affiliation(s)
- Bojana Banović Đeri
- Group for Plant Molecular Biology, Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, Belgrade, Serbia
| | - Sofija Nešić
- Group for Plant Molecular Biology, Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, Belgrade, Serbia
| | - Ivan Vićić
- Department of Food Hygiene and Technology, Faculty of Veterinary Medicine, University of Belgrade, Belgrade, Serbia
| | - Jelena Samardžić
- Group for Plant Molecular Biology, Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, Belgrade, Serbia
| | - Dragana Nikolić
- Group for Plant Molecular Biology, Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, Belgrade, Serbia
| |
Collapse
|
11
|
Zhou H, Su X, Song B. ACMGA: a reference-free multiple-genome alignment pipeline for plant species. BMC Genomics 2024; 25:515. [PMID: 38796435 PMCID: PMC11127342 DOI: 10.1186/s12864-024-10430-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 05/20/2024] [Indexed: 05/28/2024] Open
Abstract
BACKGROUND The short-read whole-genome sequencing (WGS) approach has been widely applied to investigate the genomic variation in the natural populations of many plant species. With the rapid advancements in long-read sequencing and genome assembly technologies, high-quality genome sequences are available for a group of varieties for many plant species. These genome sequences are expected to help researchers comprehensively investigate any type of genomic variants that are missed by the WGS technology. However, multiple genome alignment (MGA) tools designed by the human genome research community might be unsuitable for plant genomes. RESULTS To fill this gap, we developed the AnchorWave-Cactus Multiple Genome Alignment (ACMGA) pipeline, which improved the alignment of repeat elements and could identify long (> 50 bp) deletions or insertions (INDELs). We conducted MGA using ACMGA and Cactus for 8 Arabidopsis (Arabidopsis thaliana) and 26 Maize (Zea mays) de novo assembled genome sequences and compared them with the previously published short-read variant calling results. MGA identified more single nucleotide variants (SNVs) and long INDELs than did previously published WGS variant callings. Additionally, ACMGA detected significantly more SNVs and long INDELs in repetitive regions and the whole genome than did Cactus. Compared with the results of Cactus, the results of ACMGA were more similar to the previously published variants called using short-read. These two MGA pipelines identified numerous multi-allelic variants that were missed by the WGS variant calling pipeline. CONCLUSIONS Aligning de novo assembled genome sequences could identify more SNVs and INDELs than mapping short-read. ACMGA combines the advantages of AnchorWave and Cactus and offers a practical solution for plant MGA by integrating global alignment, a 2-piece-affine-gap cost strategy, and the progressive MGA algorithm.
Collapse
Affiliation(s)
- Huafeng Zhou
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong, 266071, China
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agriculture Sciences in Weifang, Weifang, Shandong, 261325, China
| | - Xiaoquan Su
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong, 266071, China.
| | - Baoxing Song
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agriculture Sciences in Weifang, Weifang, Shandong, 261325, China.
- Key Laboratory of Maize Biology and Genetic Breeding in Arid Area of Northwest Region of the Ministry of Agriculture, College of Agronomy, Northwest A&F University, Yangling, Shaanxi, 712100, China.
| |
Collapse
|