1
|
Yu H, Wang H, Liang X, Liu J, Jiang C, Chi X, Zhi N, Su P, Zha L, Gui S. Telomere-to-telomere gap-free genome assembly provides genetic insight into the triterpenoid saponins biosynthesis in Platycodon grandiflorus. HORTICULTURE RESEARCH 2025; 12:uhaf030. [PMID: 40224331 PMCID: PMC11992332 DOI: 10.1093/hr/uhaf030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/06/2024] [Accepted: 01/29/2025] [Indexed: 04/15/2025]
Abstract
Platycodon grandiflorus has been widely used in Asia as a medicinal herb and food because of its anti-inflammatory and hepatoprotective properties. P. grandiflorus has important clinical value because of the active triterpenoid saponins in its roots. However, the biosynthetic pathway of triterpenoid saponins in P. grandiflorus remains unclear, and the related genes remain unknown. Therefore, in this study, we assembled a high-quality and integrated telomere-to-telomere P. grandiflorus reference genome and combined time-specific transcriptome and metabolome profiling to identify the cytochrome P450s (CYPs) responsible for the hydroxylation processes involved in triterpenoid saponin biosynthesis. Nine chromosomes were assembled without gaps or mismatches, and nine centromeres and 18 telomere regions were identified. This genome eliminated redundant sequences from previous genome versions and incorporated structural variation information. Comparative analysis of the P. grandiflorus genome revealed that P. grandiflorus underwent a core eudicot γ-WGT event. We screened 211 CYPs and found that tandem and proximal duplications may be crucial for the expansion of CYP families. We outlined the proposed hydroxylation steps, likely catalyzed by the CYP716A/72A/749A families, in platycodin biosynthesis and identified three PgCYP716A, seven PgCYP72A, and seven PgCYP749A genes that showed a positive correlation with platycodin biosynthesis. By establishing a T2T assembly genome, transcriptome, and metabolome resource for P. grandiflorus, we provide a foundation for the complete elucidation of the platycodins biosynthetic pathway, which consequently leads to heterologous bioproduction, and serves as a fundamental genetic resource for molecular-assisted breeding and genetic improvement of P. grandiflorus.
Collapse
Affiliation(s)
- Hanwen Yu
- College of Pharmacy, Anhui University of Chinese Medicine, Hefei 230012, China
| | - Haixia Wang
- College of Pharmacy, Anhui University of Chinese Medicine, Hefei 230012, China
| | - Xiao Liang
- College of Pharmacy, Anhui University of Chinese Medicine, Hefei 230012, China
| | - Juan Liu
- State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs, National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China
| | - Chao Jiang
- State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs, National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China
| | - Xiulian Chi
- State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs, National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China
| | - Nannan Zhi
- College of Pharmacy, Anhui University of Chinese Medicine, Hefei 230012, China
| | - Ping Su
- State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs, National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China
| | - Liangping Zha
- College of Pharmacy, Anhui University of Chinese Medicine, Hefei 230012, China
- Institute of Conservation and Development of Traditional Chinese Medicine Resources, Anhui Academy of Chinese Medicine, Hefei 230012, China
- MOE-Anhui Joint Collaborative Innovation Center for Quality Improvement of Anhui Genuine Chinese Medicinal Materials, Hefei 230012, China
- Center for Xin'an Medicine and Modernization of Traditional Chinese Medicine of IHM, Anhui University of Chinese Medicine, Hefei 230012, China
| | - Shuangying Gui
- College of Pharmacy, Anhui University of Chinese Medicine, Hefei 230012, China
- MOE-Anhui Joint Collaborative Innovation Center for Quality Improvement of Anhui Genuine Chinese Medicinal Materials, Hefei 230012, China
- Institute of Pharmaceutics, Anhui Academy of Chinese Medicine, Hefei 230012, China
- Anhui Province Key Laboratory of Pharmaceutical Preparation Technology and Application, Hefei 230012, China
| |
Collapse
|
2
|
Feng S, Wang Z, Lin K, Wang K, Zheng S, Wang Q, Lin L, Lu Y. Haplotype-resolved genomes of Trichophyton mentagrophytes and Trichophyton tonsurans. Sci Data 2025; 12:559. [PMID: 40210855 PMCID: PMC11985949 DOI: 10.1038/s41597-025-04835-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Accepted: 03/14/2025] [Indexed: 04/12/2025] Open
Abstract
Dermatophytes have posed a significant health concern due to their ability to parasitize human and animal skin, hair, and nails, causing a spectrum of dermatological conditions. However, the absence of high-quality genomes hinders our understanding of the dermatophytes. In this study, we utilized the circular consensus sequencing (CCS) technology to generate haplotype-resolved, nearly-complete genomes for two representative dermatophytes, Trichophyton mentagrophytes and Trichophyton tonsurans. Total sizes of the genomes ranged from 23.8 Mb to 25.2 Mb, with the contig N50 lengths of 6.47 Mb and 12.65 Mb, respectively. Each genome assembly was gapless and possessed three pseudochromosomes, with two achieving telomere-to-telomere (T2T) level. BUSCO analysis of the assemblies revealed approximately 99% of genome completeness. More than 7500 protein-coding genes were identified, and over 99% of the genes were well annotated through multiple gene function databases. Approximately 10% of the genomes were covered by repeats, particularly retrotransposons. Our findings provided valuable genomic resources of dermatophytes, paving the way for developing more effective medical interventions and public health strategies against Trichophyton infections.
Collapse
Affiliation(s)
- Sijie Feng
- School of Medicine, Henan Polytechnic University, 454000, Jiaozuo, China
- School of Medicine, Zhejiang University, 310016, Hangzhou, China
| | - Zhenhui Wang
- School of Medicine, Henan Polytechnic University, 454000, Jiaozuo, China
| | - Kainan Lin
- School of Medicine, Zhejiang University, 310016, Hangzhou, China
| | - Kun Wang
- School of Medicine, Henan Polytechnic University, 454000, Jiaozuo, China
| | - Shuting Zheng
- School of Medicine, Henan Polytechnic University, 454000, Jiaozuo, China
| | - Qianqian Wang
- School of Medicine, Zhejiang University, 310016, Hangzhou, China.
| | - Lianyu Lin
- State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, College of Life Science, Fujian Agriculture and Forestry University, 350002, Fuzhou, China.
| | - Yunkun Lu
- School of Medicine, Zhejiang University, 310016, Hangzhou, China.
| |
Collapse
|
3
|
Mouratidis I, Konnaris MA, Chantzi N, Chan CSY, Patsakis M, Provatas K, Montgomery A, Baltoumas FA, Sha CM, Mareboina M, Pavlopoulos GA, Chartoumpekis DV, Georgakopoulos-Soares I. Identification of the shortest species-specific oligonucleotide sequences. Genome Res 2025; 35:279-295. [PMID: 39746719 PMCID: PMC11874967 DOI: 10.1101/gr.280070.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 11/27/2024] [Indexed: 01/04/2025]
Abstract
Despite the exponential increase in sequencing information driven by massively parallel DNA sequencing technologies, universal and succinct genomic fingerprints for each organism are still missing. Identifying the shortest species-specific nucleotide sequences offers insights into species evolution and holds potential practical applications in agriculture, wildlife conservation, and healthcare. We propose a new method for sequence analysis termed nucleic "quasi-primes," the shortest occurring sequences in each of 45,076 organismal reference genomes, present in one genome and absent from every other examined genome. In the human genome, we find that the genomic loci of nucleic quasi-primes are most enriched for genes associated with brain development and cognitive function. In a single-cell case study focusing on the human primary motor cortex, nucleic quasi-prime genes account for a significantly larger proportion of the variation based on average gene expression. Nonneuronal cell types, including astrocytes, endothelial cells, microglia perivascular-macrophages, oligodendrocytes, and vascular and leptomeningeal cells, exhibit significant activation of quasi-prime-containing gene associations related to cancer, whereas simultaneously suppressing quasi-prime-containing genes are associated with cognitive, mental, and developmental disorders. We also show that human disease-causing variants, eQTLs, mQTLs, and sQTLs are 4.43-fold, 4.34-fold, 4.29-fold, and 4.21-fold enriched at human quasi-prime loci, respectively. These findings indicate that nucleic quasi-primes are genomic loci linked to the evolution of species-specific traits, and in humans, they provide insights in the development of cognitive traits and human diseases, including neurodevelopmental disorders.
Collapse
Affiliation(s)
- Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Maxwell A Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Candace S Y Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California 94143, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- National Technical University of Athens, School of Electrical and Computer Engineering, Athens 15772, Greece
| | - Kimonas Provatas
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- National Technical University of Athens, School of Electrical and Computer Engineering, Athens 15772, Greece
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
| | - Fotis A Baltoumas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming," Vari 16672, Greece
| | - Congzhou M Sha
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
| | - Georgios A Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming," Vari 16672, Greece
- Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens 11527, Greece
| | - Dionysios V Chartoumpekis
- Service of Endocrinology, Diabetology and Metabolism, Lausanne University Hospital, 1005 Lausanne, Switzerland
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA;
| |
Collapse
|
4
|
Edwards RJ, Chen SH, Halliday B, Bragg JG. Small but Mitey: A Gapless Telomere-to-Telomere Assembly of an Unidentified Mite With a Streamlined Genome. Genome Biol Evol 2025; 17:evaf023. [PMID: 39943745 DOI: 10.1093/gbe/evaf023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/23/2025] [Indexed: 03/06/2025] Open
Abstract
A draft assembly of the rainforest tree Rhodamnia argentea Benth. (malletwood, Myrtaceae) revealed contaminating DNA sequences that most closely matched those from mites in the family Eriophyidae. Eriophyoid mites are plant parasites that often induce galls or other deformities on their host plants. They are notable for their small size (averaging 200 μm), distinctive four-legged body structure, and heavily streamlined genomes, which are among the smallest known of all arthropods. Contaminating mite sequences were assembled into a high-quality gapless telomere-to-telomere nuclear genome. The entire genome was assembled on two fully contiguous chromosomes, capped with a novel TTTGG or TTTGGTGTTGG telomere sequence, and exhibited clear signs of genome reduction (34.5 Mbp total length, 68.6% arachnid Benchmarking Universal Single-Copy Ortholog completeness). Phylogenomic analysis confirmed that this genome is that of a previously unsequenced eriophyoid mite. Despite its unknown identity, this complete nuclear genome provides a valuable resource to investigate invertebrate genome reduction.
Collapse
Affiliation(s)
- Richard J Edwards
- School of Biotechnology and Biomolecular Sciences, Evolution & Ecology Research Centre, University of New South Wales, Kensington, NSW 2052, Australia
- Minderoo OceanOmics Centre at UWA, Oceans Institute, University of Western Australia, Perth, WA 6009, Australia
| | - Stephanie H Chen
- School of Biotechnology and Biomolecular Sciences, Evolution & Ecology Research Centre, University of New South Wales, Kensington, NSW 2052, Australia
- Research Centre for Ecosystem Resilience, Botanic Gardens of Sydney, Sydney, NSW 2000, Australia
- Centre for Australian National Biodiversity Research (a joint venture between Parks Australia and CSIRO), Canberra, ACT 2601, Australia
| | - Bruce Halliday
- Australian National Insect Collection, CSIRO, Canberra, ACT 2601, Australia
| | - Jason G Bragg
- Research Centre for Ecosystem Resilience, Botanic Gardens of Sydney, Sydney, NSW 2000, Australia
| |
Collapse
|
5
|
Mouratidis I, Baltoumas FA, Chantzi N, Patsakis M, Chan CS, Montgomery A, Konnaris MA, Aplakidou E, Georgakopoulos GC, Das A, Chartoumpekis DV, Kovac J, Pavlopoulos GA, Georgakopoulos-Soares I. kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species. Comput Struct Biotechnol J 2024; 23:1919-1928. [PMID: 38711760 PMCID: PMC11070822 DOI: 10.1016/j.csbj.2024.04.050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 04/17/2024] [Accepted: 04/18/2024] [Indexed: 05/08/2024] Open
Abstract
The decrease in sequencing expenses has facilitated the creation of reference genomes and proteomes for an expanding array of organisms. Nevertheless, no established repository that details organism-specific genomic and proteomic sequences of specific lengths, referred to as kmers, exists to our knowledge. In this article, we present kmerDB, a database accessible through an interactive web interface that provides kmer-based information from genomic and proteomic sequences in a systematic way. kmerDB currently contains 202,340,859,107 base pairs and 19,304,903,356 amino acids, spanning 54,039 and 21,865 reference genomes and proteomes, respectively, as well as 6,905,362 and 149,305,183 genomic and proteomic species-specific sequences, termed quasi-primes. Additionally, we provide access to 5,186,757 nucleic and 214,904,089 peptide sequences absent from every genome and proteome, termed primes. kmerDB features a user-friendly interface offering various search options and filters for easy parsing and searching. The service is available at: www.kmerdb.com.
Collapse
Affiliation(s)
- Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Fotis A. Baltoumas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
- Department of Statistics, The Pennsylvania State University, University Park, PA, USA
| | - Eleni Aplakidou
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
- Department of Basic Sciences, School of Medicine, University of Crete, Heraklion, Greece
| | - George C. Georgakopoulos
- National Technical University of Athens, School of Electrical and Computer Engineering, Athens, Greece
| | - Anshuman Das
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Dionysios V. Chartoumpekis
- Service of Endocrinology, Diabetology and Metabolism, Lausanne University Hospital, Lausanne, Switzerland
| | - Jasna Kovac
- Department of Food Science, The Pennsylvania State University, University Park, PA 16802, USA
| | - Georgios A. Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
- Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens, 11527, Greece
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| |
Collapse
|
6
|
Tao Y, Xian W, Bao Z, Rabanal FA, Movilli A, Lanz C, Shirsekar G, Weigel D. Atlas of telomeric repeat diversity in Arabidopsis thaliana. Genome Biol 2024; 25:244. [PMID: 39285474 PMCID: PMC11406999 DOI: 10.1186/s13059-024-03388-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Accepted: 09/03/2024] [Indexed: 09/19/2024] Open
Abstract
BACKGROUND Telomeric repeat arrays at the ends of chromosomes are highly dynamic in composition, but their repetitive nature and technological limitations have made it difficult to assess their true variation in genome diversity surveys. RESULTS We have comprehensively characterized the sequence variation immediately adjacent to the canonical telomeric repeat arrays at the very ends of chromosomes in 74 genetically diverse Arabidopsis thaliana accessions. We first describe several types of distinct telomeric repeat units and then identify evolutionary processes such as local homogenization and higher-order repeat formation that shape diversity of chromosome ends. By comparing largely isogenic samples, we also determine repeat number variation of the degenerate and variant telomeric repeat array at both the germline and somatic levels. Finally, our analysis of haplotype structure uncovers chromosome end-specific patterns in the distribution of variant telomeric repeats, and their linkage to the more proximal non-coding region. CONCLUSIONS Our findings illustrate the spectrum of telomeric repeat variation at multiple levels in A. thaliana-in germline and soma, across all chromosome ends, and across genetic groups-thereby expanding our knowledge of the evolution of chromosome ends.
Collapse
Affiliation(s)
- Yueqi Tao
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, 72076, Germany
| | - Wenfei Xian
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, 72076, Germany
| | - Zhigui Bao
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, 72076, Germany
| | - Fernando A Rabanal
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, 72076, Germany
| | - Andrea Movilli
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, 72076, Germany
| | - Christa Lanz
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, 72076, Germany
| | - Gautam Shirsekar
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, 72076, Germany
| | - Detlef Weigel
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, 72076, Germany.
| |
Collapse
|
7
|
Khalaf A, Francis O, Blaxter ML. Genome evolution in intracellular parasites: Microsporidia and Apicomplexa. J Eukaryot Microbiol 2024; 71:e13033. [PMID: 38785208 DOI: 10.1111/jeu.13033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Revised: 03/29/2024] [Accepted: 05/02/2024] [Indexed: 05/25/2024]
Abstract
Microsporidia and Apicomplexa are eukaryotic, single-celled, intracellular parasites with huge public health and economic importance. Typically, these parasites are studied separately, emphasizing their uniqueness and diversity. In this review, we explore the huge amount of genomic data that has recently become available for the two groups. We compare and contrast their genome evolution and discuss how their transitions to intracellular life may have shaped it. In particular, we explore genome reduction and compaction, genome expansion and ploidy, gene shuffling and rearrangements, and the evolution of centromeres and telomeres.
Collapse
Affiliation(s)
- Amjad Khalaf
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Ore Francis
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | | |
Collapse
|
8
|
Malusare A, Kothandaraman H, Tamboli D, Lanman NA, Aggarwal V. Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision. ARXIV 2024:arXiv:2311.02333v3. [PMID: 38410643 PMCID: PMC10896356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 02/28/2024]
Abstract
This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.
Collapse
Affiliation(s)
- Aditya Malusare
- School of Industrial Engineering, Purdue University, USA
- Institute for Cancer Research, Purdue University, USA
| | | | - Dipesh Tamboli
- Elmore Family School of Electrical and Computer Engineering, Purdue University, USA
| | - Nadia A. Lanman
- Institute for Cancer Research, Purdue University, USA
- Department of Comparative Pathobiology, Purdue University, USA
| | - Vaneet Aggarwal
- School of Industrial Engineering, Purdue University, USA
- Institute for Cancer Research, Purdue University, USA
- Elmore Family School of Electrical and Computer Engineering, Purdue University, USA
| |
Collapse
|
9
|
Malusare A, Kothandaraman H, Tamboli D, Lanman NA, Aggarwal V. Understanding the natural language of DNA using encoder-decoder foundation models with byte-level precision. BIOINFORMATICS ADVANCES 2024; 4:vbae117. [PMID: 39176288 PMCID: PMC11341122 DOI: 10.1093/bioadv/vbae117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Revised: 07/06/2024] [Accepted: 08/10/2024] [Indexed: 08/24/2024]
Abstract
Summary This article presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a subquadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pretrain the foundation model using reference genome sequences and apply it in the following downstream tasks: (i) identification of enhancers, promotors, and splice sites, (ii) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (iii) identification of biological function annotations of genomic sequences, and (iv) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results. Availability and implementation The source code used to develop and fine-tune the foundation model has been released on Github (https://github.itap.purdue.edu/Clan-labs/ENBED).
Collapse
Affiliation(s)
- Aditya Malusare
- School of Industrial Engineering, Purdue University, West Lafayette, IN 47907, United States
- Institute for Cancer Research, Purdue University, West Lafayette, IN 47907, United States
| | - Harish Kothandaraman
- Institute for Cancer Research, Purdue University, West Lafayette, IN 47907, United States
| | - Dipesh Tamboli
- Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907, United States
| | - Nadia A Lanman
- Institute for Cancer Research, Purdue University, West Lafayette, IN 47907, United States
- Department of Comparative Pathobiology, Purdue University, West Lafayette, IN 47907, United States
| | - Vaneet Aggarwal
- School of Industrial Engineering, Purdue University, West Lafayette, IN 47907, United States
- Institute for Cancer Research, Purdue University, West Lafayette, IN 47907, United States
- Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907, United States
| |
Collapse
|
10
|
Stoianova D, Grozeva S, Golub NV, Anokhin BA, Kuznetsova VG. The First FISH-Confirmed Non-Canonical Telomeric Motif in Heteroptera: Cimex lectularius Linnaeus, 1758 and C. hemipterus (Fabricius, 1803) (Hemiptera, Cimicidae) Have a 10 bp Motif (TTAGGGATGG) n. Genes (Basel) 2024; 15:1026. [PMID: 39202386 PMCID: PMC11354137 DOI: 10.3390/genes15081026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Revised: 07/26/2024] [Accepted: 08/01/2024] [Indexed: 09/03/2024] Open
Abstract
Fluorescence in situ hybridization (FISH) with two different probes, the canonical insect telomeric sequence (TTAGG)n and the sequence (TTAGGGATGG)n, was performed on meiotic chromosomes of two members of the true bug family Cimicidae (Cimicomorpha), the common bed bug Cimex lectularius Linnaeus, 1758 and the tropical bed bug C. hemipterus (Fabricius, 1803), whose telomeric motifs were not known. In both species, there were no hybridization signals with the first probe, but strong signals at chromosomal ends were observed with the second probe, indicating the presence of a telomeric motif (TTAGGGATGG)n. This study represents the first FISH confirmation of the presence of a non-canonical telomeric motif not only for the infraorder Cimicomorpha but also for the suborder Heteroptera (Hemiptera) as a whole. The present finding is of key significance for unraveling the evolutionary shifts in the telomeric sequences in this suborder.
Collapse
Affiliation(s)
- Desislava Stoianova
- Institute of Biodiversity and Ecosystem Research, Bulgarian Academy of Sciences, 1000 Sofia, Bulgaria;
| | - Snejana Grozeva
- Institute of Biodiversity and Ecosystem Research, Bulgarian Academy of Sciences, 1000 Sofia, Bulgaria;
| | - Natalia V. Golub
- Zoological Institute, Russian Academy of Sciences, 199034 St. Petersburg, Russia; (N.V.G.); (B.A.A.)
| | - Boris A. Anokhin
- Zoological Institute, Russian Academy of Sciences, 199034 St. Petersburg, Russia; (N.V.G.); (B.A.A.)
| | - Valentina G. Kuznetsova
- Zoological Institute, Russian Academy of Sciences, 199034 St. Petersburg, Russia; (N.V.G.); (B.A.A.)
| |
Collapse
|
11
|
Xu Y, Wang C, Li Z, Zheng X, Kang Z, Lu P, Zhang J, Cao P, Chen Q, Liu X. A chromosome-level haplotype-resolved genome assembly of oriental tobacco budworm (Helicoverpa assulta). Sci Data 2024; 11:461. [PMID: 38710675 DOI: 10.1038/s41597-024-03264-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Accepted: 04/15/2024] [Indexed: 05/08/2024] Open
Abstract
Oriental tobacco budworm (Helicoverpa assulta) and cotton bollworm (Helicoverpa armigera) are two closely related species within the genus Helicoverpa. They have similar appearances and consistent damage patterns, often leading to confusion. However, the cotton bollworm is a typical polyphagous insect, while the oriental tobacco budworm belongs to the oligophagous insects. In this study, we used Nanopore, PacBio, and Illumina platforms to sequence the genome of H. assulta and used Hifiasm to create a haplotype-resolved draft genome. The Hi-C technique helped anchor 33 primary contigs to 32 chromosomes, including two sex chromosomes, Z and W. The final primary haploid genome assembly was approximately 415.19 Mb in length. BUSCO analysis revealed a high degree of completeness, with 99.0% gene coverage in this genome assembly. The repeat sequences constituted 38.39% of the genome assembly, and we annotated 17093 protein-coding genes. The high-quality genome assembly of the oriental tobacco budworm serves as a valuable genetic resource that enhances our comprehension of how they select hosts in a complex odour environment. It will also aid in developing an effective control policy.
Collapse
Affiliation(s)
- Yalong Xu
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Chen Wang
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Zefeng Li
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Xueao Zheng
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Zhengzhong Kang
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Peng Lu
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Jianfeng Zhang
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Peijian Cao
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Qiansi Chen
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China.
- Beijing Life Science Academy (BLSA), Beijing, 102209, China.
| | - Xiaoguang Liu
- Institution Henan International Laboratory for Green Pest Control, Henan Engineering Laboratory of Pest Biological Control, College of Plant Protection, Henan Agricultural University, Zhengzhou, 450000, China.
| |
Collapse
|
12
|
Shearman JR, Naktang C, Sonthirod C, Kongkachana W, U-Thoomporn S, Jomchai N, Maknual C, Yamprasai S, Wanthongchai P, Pootakham W, Tangphatsornruang S. De novo assembly and analysis of Sonneratia ovata genome and population analysis. Genomics 2024; 116:110837. [PMID: 38548034 DOI: 10.1016/j.ygeno.2024.110837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Revised: 02/22/2024] [Accepted: 03/24/2024] [Indexed: 04/01/2024]
Abstract
Mangroves are an important part of coastal and estuarine ecosystems where they serve as nurseries for marine species and prevent coastal erosion. Here we report the genome of Sonneratia ovata, which is a true mangrove that grows in estuarine environments and can tolerate moderate salt exposure. We sequenced the S. ovata genome and assembled it into chromosome-level scaffolds through the use of Hi-C. The genome is 212.3 Mb and contains 12 chromosomes that range in size from 12.2 to 23.2 Mb. Annotation identified 29,829 genes with a BUSCO completeness of 95.9%. We identified salt genes and found copy number expansion of salt genes such as ADP-ribosylation factor 1, and elongation factor 1-alpha. Population analysis identified a low level of genetic variation and a lack of population structure within S. ovata.
Collapse
Affiliation(s)
- Jeremy R Shearman
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Chaiwat Naktang
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Chutima Sonthirod
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Wasitthee Kongkachana
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Sonicha U-Thoomporn
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Nukoon Jomchai
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Chatree Maknual
- Department of Marine and Coastal Resources, 120 The Government Complex, Chaengwatthana Rd., Thung Song Hong, Bangkok 10210, Thailand
| | - Suchart Yamprasai
- Department of Marine and Coastal Resources, 120 The Government Complex, Chaengwatthana Rd., Thung Song Hong, Bangkok 10210, Thailand
| | - Poonsri Wanthongchai
- Department of Marine and Coastal Resources, 120 The Government Complex, Chaengwatthana Rd., Thung Song Hong, Bangkok 10210, Thailand
| | - Wirulda Pootakham
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Sithichoke Tangphatsornruang
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand.
| |
Collapse
|
13
|
Rigden DJ, Fernández XM. The 2024 Nucleic Acids Research database issue and the online molecular biology database collection. Nucleic Acids Res 2024; 52:D1-D9. [PMID: 38035367 PMCID: PMC10767945 DOI: 10.1093/nar/gkad1173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 11/23/2023] [Indexed: 12/02/2023] Open
Abstract
The 2024 Nucleic Acids Research database issue contains 180 papers from across biology and neighbouring disciplines. There are 90 papers reporting on new databases and 83 updates from resources previously published in the Issue. Updates from databases most recently published elsewhere account for a further seven. Nucleic acid databases include the new NAKB for structural information and updates from Genbank, ENA, GEO, Tarbase and JASPAR. The Issue's Breakthrough Article concerns NMPFamsDB for novel prokaryotic protein families and the AlphaFold Protein Structure Database has an important update. Metabolism is covered by updates from Reactome, Wikipathways and Metabolights. Microbes are covered by RefSeq, UNITE, SPIRE and P10K; viruses by ViralZone and PhageScope. Medically-oriented databases include the familiar COSMIC, Drugbank and TTD. Genomics-related resources include Ensembl, UCSC Genome Browser and Monarch. New arrivals cover plant imaging (OPIA and PlantPAD) and crop plants (SoyMD, TCOD and CropGS-Hub). The entire Database Issue is freely available online on the Nucleic Acids Research website (https://academic.oup.com/nar). Over the last year the NAR online Molecular Biology Database Collection has been updated, reviewing 1060 entries, adding 97 new resources and eliminating 388 discontinued URLs bringing the current total to 1959 databases. It is available at http://www.oxfordjournals.org/nar/database/c/.
Collapse
Affiliation(s)
- Daniel J Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Crown Street, Liverpool L69 7ZB, UK
| | | |
Collapse
|
14
|
Li B. Unwrap RAP1's Mystery at Kinetoplastid Telomeres. Biomolecules 2024; 14:67. [PMID: 38254667 PMCID: PMC10813129 DOI: 10.3390/biom14010067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 12/27/2023] [Accepted: 12/27/2023] [Indexed: 01/24/2024] Open
Abstract
Although located at the chromosome end, telomeres are an essential chromosome component that helps maintain genome integrity and chromosome stability from protozoa to mammals. The role of telomere proteins in chromosome end protection is conserved, where they suppress various DNA damage response machineries and block nucleolytic degradation of the natural chromosome ends, although the detailed underlying mechanisms are not identical. In addition, the specialized telomere structure exerts a repressive epigenetic effect on expression of genes located at subtelomeres in a number of eukaryotic organisms. This so-called telomeric silencing also affects virulence of a number of microbial pathogens that undergo antigenic variation/phenotypic switching. Telomere proteins, particularly the RAP1 homologs, have been shown to be a key player for telomeric silencing. RAP1 homologs also suppress the expression of Telomere Repeat-containing RNA (TERRA), which is linked to their roles in telomere stability maintenance. The functions of RAP1s in suppressing telomere recombination are largely conserved from kinetoplastids to mammals. However, the underlying mechanisms of RAP1-mediated telomeric silencing have many species-specific features. In this review, I will focus on Trypanosoma brucei RAP1's functions in suppressing telomeric/subtelomeric DNA recombination and in the regulation of monoallelic expression of subtelomere-located major surface antigen genes. Common and unique mechanisms will be compared among RAP1 homologs, and their implications will be discussed.
Collapse
Affiliation(s)
- Bibo Li
- Center for Gene Regulation in Health and Disease, Department of Biological, Geological, and Environmental Sciences, College of Arts and Sciences, Cleveland State University, 2121 Euclid Avenue, Cleveland, OH 44115, USA;
- Case Comprehensive Cancer Center, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, OH 44106, USA
- Department of Inflammation and Immunity, Lerner Research Institute, Cleveland Clinic, 9500 Euclid Avenue, Cleveland, OH 44195, USA
- Center for RNA Science and Therapeutics, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, OH 44106, USA
| |
Collapse
|
15
|
Li B. Telomere maintenance in African trypanosomes. Front Mol Biosci 2023; 10:1302557. [PMID: 38074093 PMCID: PMC10704157 DOI: 10.3389/fmolb.2023.1302557] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Accepted: 11/15/2023] [Indexed: 02/12/2024] Open
Abstract
Telomere maintenance is essential for genome integrity and chromosome stability in eukaryotic cells harboring linear chromosomes, as telomere forms a specialized structure to mask the natural chromosome ends from DNA damage repair machineries and to prevent nucleolytic degradation of the telomeric DNA. In Trypanosoma brucei and several other microbial pathogens, virulence genes involved in antigenic variation, a key pathogenesis mechanism essential for host immune evasion and long-term infections, are located at subtelomeres, and expression and switching of these major surface antigens are regulated by telomere proteins and the telomere structure. Therefore, understanding telomere maintenance mechanisms and how these pathogens achieve a balance between stability and plasticity at telomere/subtelomere will help develop better means to eradicate human diseases caused by these pathogens. Telomere replication faces several challenges, and the "end replication problem" is a key obstacle that can cause progressive telomere shortening in proliferating cells. To overcome this challenge, most eukaryotes use telomerase to extend the G-rich telomere strand. In addition, a number of telomere proteins use sophisticated mechanisms to coordinate the telomerase-mediated de novo telomere G-strand synthesis and the telomere C-strand fill-in, which has been extensively studied in mammalian cells. However, we recently discovered that trypanosomes lack many telomere proteins identified in its mammalian host that are critical for telomere end processing. Rather, T. brucei uses a unique DNA polymerase, PolIE that belongs to the DNA polymerase A family (E. coli DNA PolI family), to coordinate the telomere G- and C-strand syntheses. In this review, I will first briefly summarize current understanding of telomere end processing in mammals. Subsequently, I will describe PolIE-mediated coordination of telomere G- and C-strand synthesis in T. brucei and implication of this recent discovery.
Collapse
Affiliation(s)
- Bibo Li
- Center for Gene Regulation in Health and Disease, Department of Biological, Geological, and Environmental Sciences, College of Arts and Sciences, Cleveland State University, Cleveland, OH, United States
- Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, OH, United States
- Department of Inflammation and Immunity, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, United States
- Center for RNA Science and Therapeutics, Case Western Reserve University, Cleveland, OH, United States
| |
Collapse
|
16
|
Gokhman VE. Chromosome study of the Hymenoptera (Insecta): from cytogenetics to cytogenomics. COMPARATIVE CYTOGENETICS 2023; 17:239-250. [PMID: 37953851 PMCID: PMC10632776 DOI: 10.3897/compcytogen.17.112332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 10/19/2023] [Indexed: 11/14/2023]
Abstract
A brief overview of the current stage of the chromosome study of the insect order Hymenoptera is given. It is demonstrated that, in addition to routine staining and other traditional techniques of chromosome research, karyotypes of an increasing number of hymenopterans are being studied using molecular methods, e.g., staining with base-specific fluorochromes and fluorescence in situ hybridization (FISH), including microdissection and chromosome painting. Due to the advent of whole genome sequencing and other molecular techniques, together with the "big data" approach to the chromosomal data, the current stage of the chromosome research on Hymenoptera represents a transition from Hymenoptera cytogenetics to cytogenomics.
Collapse
Affiliation(s)
- Vladimir E. Gokhman
- Botanical Garden, Moscow State University, Moscow 119234, RussiaMoscow State UniversityMoscowRussia
| |
Collapse
|