1
|
Mouratidis I, Baltoumas FA, Chantzi N, Patsakis M, Chan CS, Montgomery A, Konnaris MA, Aplakidou E, Georgakopoulos GC, Das A, Chartoumpekis DV, Kovac J, Pavlopoulos GA, Georgakopoulos-Soares I. kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species. Comput Struct Biotechnol J 2024; 23:1919-1928. [PMID: 38711760 PMCID: PMC11070822 DOI: 10.1016/j.csbj.2024.04.050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 04/17/2024] [Accepted: 04/18/2024] [Indexed: 05/08/2024] Open
Abstract
The decrease in sequencing expenses has facilitated the creation of reference genomes and proteomes for an expanding array of organisms. Nevertheless, no established repository that details organism-specific genomic and proteomic sequences of specific lengths, referred to as kmers, exists to our knowledge. In this article, we present kmerDB, a database accessible through an interactive web interface that provides kmer-based information from genomic and proteomic sequences in a systematic way. kmerDB currently contains 202,340,859,107 base pairs and 19,304,903,356 amino acids, spanning 54,039 and 21,865 reference genomes and proteomes, respectively, as well as 6,905,362 and 149,305,183 genomic and proteomic species-specific sequences, termed quasi-primes. Additionally, we provide access to 5,186,757 nucleic and 214,904,089 peptide sequences absent from every genome and proteome, termed primes. kmerDB features a user-friendly interface offering various search options and filters for easy parsing and searching. The service is available at: www.kmerdb.com.
Collapse
Affiliation(s)
- Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Fotis A. Baltoumas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
- Department of Statistics, The Pennsylvania State University, University Park, PA, USA
| | - Eleni Aplakidou
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
- Department of Basic Sciences, School of Medicine, University of Crete, Heraklion, Greece
| | - George C. Georgakopoulos
- National Technical University of Athens, School of Electrical and Computer Engineering, Athens, Greece
| | - Anshuman Das
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Dionysios V. Chartoumpekis
- Service of Endocrinology, Diabetology and Metabolism, Lausanne University Hospital, Lausanne, Switzerland
| | - Jasna Kovac
- Department of Food Science, The Pennsylvania State University, University Park, PA 16802, USA
| | - Georgios A. Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
- Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens, 11527, Greece
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| |
Collapse
|
2
|
Khalaf A, Francis O, Blaxter ML. Genome evolution in intracellular parasites: Microsporidia and Apicomplexa. J Eukaryot Microbiol 2024:e13033. [PMID: 38785208 DOI: 10.1111/jeu.13033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Revised: 03/29/2024] [Accepted: 05/02/2024] [Indexed: 05/25/2024]
Abstract
Microsporidia and Apicomplexa are eukaryotic, single-celled, intracellular parasites with huge public health and economic importance. Typically, these parasites are studied separately, emphasizing their uniqueness and diversity. In this review, we explore the huge amount of genomic data that has recently become available for the two groups. We compare and contrast their genome evolution and discuss how their transitions to intracellular life may have shaped it. In particular, we explore genome reduction and compaction, genome expansion and ploidy, gene shuffling and rearrangements, and the evolution of centromeres and telomeres.
Collapse
Affiliation(s)
- Amjad Khalaf
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Ore Francis
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | | |
Collapse
|
3
|
Xu Y, Wang C, Li Z, Zheng X, Kang Z, Lu P, Zhang J, Cao P, Chen Q, Liu X. A chromosome-level haplotype-resolved genome assembly of oriental tobacco budworm (Helicoverpa assulta). Sci Data 2024; 11:461. [PMID: 38710675 DOI: 10.1038/s41597-024-03264-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Accepted: 04/15/2024] [Indexed: 05/08/2024] Open
Abstract
Oriental tobacco budworm (Helicoverpa assulta) and cotton bollworm (Helicoverpa armigera) are two closely related species within the genus Helicoverpa. They have similar appearances and consistent damage patterns, often leading to confusion. However, the cotton bollworm is a typical polyphagous insect, while the oriental tobacco budworm belongs to the oligophagous insects. In this study, we used Nanopore, PacBio, and Illumina platforms to sequence the genome of H. assulta and used Hifiasm to create a haplotype-resolved draft genome. The Hi-C technique helped anchor 33 primary contigs to 32 chromosomes, including two sex chromosomes, Z and W. The final primary haploid genome assembly was approximately 415.19 Mb in length. BUSCO analysis revealed a high degree of completeness, with 99.0% gene coverage in this genome assembly. The repeat sequences constituted 38.39% of the genome assembly, and we annotated 17093 protein-coding genes. The high-quality genome assembly of the oriental tobacco budworm serves as a valuable genetic resource that enhances our comprehension of how they select hosts in a complex odour environment. It will also aid in developing an effective control policy.
Collapse
Affiliation(s)
- Yalong Xu
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Chen Wang
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Zefeng Li
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Xueao Zheng
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Zhengzhong Kang
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Peng Lu
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Jianfeng Zhang
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Peijian Cao
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
- Beijing Life Science Academy (BLSA), Beijing, 102209, China
| | - Qiansi Chen
- China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China.
- Beijing Life Science Academy (BLSA), Beijing, 102209, China.
| | - Xiaoguang Liu
- Institution Henan International Laboratory for Green Pest Control, Henan Engineering Laboratory of Pest Biological Control, College of Plant Protection, Henan Agricultural University, Zhengzhou, 450000, China.
| |
Collapse
|
4
|
Shearman JR, Naktang C, Sonthirod C, Kongkachana W, U-Thoomporn S, Jomchai N, Maknual C, Yamprasai S, Wanthongchai P, Pootakham W, Tangphatsornruang S. De novo assembly and analysis of Sonneratia ovata genome and population analysis. Genomics 2024; 116:110837. [PMID: 38548034 DOI: 10.1016/j.ygeno.2024.110837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Revised: 02/22/2024] [Accepted: 03/24/2024] [Indexed: 04/01/2024]
Abstract
Mangroves are an important part of coastal and estuarine ecosystems where they serve as nurseries for marine species and prevent coastal erosion. Here we report the genome of Sonneratia ovata, which is a true mangrove that grows in estuarine environments and can tolerate moderate salt exposure. We sequenced the S. ovata genome and assembled it into chromosome-level scaffolds through the use of Hi-C. The genome is 212.3 Mb and contains 12 chromosomes that range in size from 12.2 to 23.2 Mb. Annotation identified 29,829 genes with a BUSCO completeness of 95.9%. We identified salt genes and found copy number expansion of salt genes such as ADP-ribosylation factor 1, and elongation factor 1-alpha. Population analysis identified a low level of genetic variation and a lack of population structure within S. ovata.
Collapse
Affiliation(s)
- Jeremy R Shearman
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Chaiwat Naktang
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Chutima Sonthirod
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Wasitthee Kongkachana
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Sonicha U-Thoomporn
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Nukoon Jomchai
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Chatree Maknual
- Department of Marine and Coastal Resources, 120 The Government Complex, Chaengwatthana Rd., Thung Song Hong, Bangkok 10210, Thailand
| | - Suchart Yamprasai
- Department of Marine and Coastal Resources, 120 The Government Complex, Chaengwatthana Rd., Thung Song Hong, Bangkok 10210, Thailand
| | - Poonsri Wanthongchai
- Department of Marine and Coastal Resources, 120 The Government Complex, Chaengwatthana Rd., Thung Song Hong, Bangkok 10210, Thailand
| | - Wirulda Pootakham
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand
| | - Sithichoke Tangphatsornruang
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand.
| |
Collapse
|
5
|
Malusare A, Kothandaraman H, Tamboli D, Lanman NA, Aggarwal V. Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision. ARXIV 2024:arXiv:2311.02333v2. [PMID: 38410643 PMCID: PMC10896356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 02/28/2024]
Abstract
This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.
Collapse
|
6
|
Rigden DJ, Fernández XM. The 2024 Nucleic Acids Research database issue and the online molecular biology database collection. Nucleic Acids Res 2024; 52:D1-D9. [PMID: 38035367 PMCID: PMC10767945 DOI: 10.1093/nar/gkad1173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 11/23/2023] [Indexed: 12/02/2023] Open
Abstract
The 2024 Nucleic Acids Research database issue contains 180 papers from across biology and neighbouring disciplines. There are 90 papers reporting on new databases and 83 updates from resources previously published in the Issue. Updates from databases most recently published elsewhere account for a further seven. Nucleic acid databases include the new NAKB for structural information and updates from Genbank, ENA, GEO, Tarbase and JASPAR. The Issue's Breakthrough Article concerns NMPFamsDB for novel prokaryotic protein families and the AlphaFold Protein Structure Database has an important update. Metabolism is covered by updates from Reactome, Wikipathways and Metabolights. Microbes are covered by RefSeq, UNITE, SPIRE and P10K; viruses by ViralZone and PhageScope. Medically-oriented databases include the familiar COSMIC, Drugbank and TTD. Genomics-related resources include Ensembl, UCSC Genome Browser and Monarch. New arrivals cover plant imaging (OPIA and PlantPAD) and crop plants (SoyMD, TCOD and CropGS-Hub). The entire Database Issue is freely available online on the Nucleic Acids Research website (https://academic.oup.com/nar). Over the last year the NAR online Molecular Biology Database Collection has been updated, reviewing 1060 entries, adding 97 new resources and eliminating 388 discontinued URLs bringing the current total to 1959 databases. It is available at http://www.oxfordjournals.org/nar/database/c/.
Collapse
Affiliation(s)
- Daniel J Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Crown Street, Liverpool L69 7ZB, UK
| | | |
Collapse
|
7
|
Li B. Unwrap RAP1's Mystery at Kinetoplastid Telomeres. Biomolecules 2024; 14:67. [PMID: 38254667 PMCID: PMC10813129 DOI: 10.3390/biom14010067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 12/27/2023] [Accepted: 12/27/2023] [Indexed: 01/24/2024] Open
Abstract
Although located at the chromosome end, telomeres are an essential chromosome component that helps maintain genome integrity and chromosome stability from protozoa to mammals. The role of telomere proteins in chromosome end protection is conserved, where they suppress various DNA damage response machineries and block nucleolytic degradation of the natural chromosome ends, although the detailed underlying mechanisms are not identical. In addition, the specialized telomere structure exerts a repressive epigenetic effect on expression of genes located at subtelomeres in a number of eukaryotic organisms. This so-called telomeric silencing also affects virulence of a number of microbial pathogens that undergo antigenic variation/phenotypic switching. Telomere proteins, particularly the RAP1 homologs, have been shown to be a key player for telomeric silencing. RAP1 homologs also suppress the expression of Telomere Repeat-containing RNA (TERRA), which is linked to their roles in telomere stability maintenance. The functions of RAP1s in suppressing telomere recombination are largely conserved from kinetoplastids to mammals. However, the underlying mechanisms of RAP1-mediated telomeric silencing have many species-specific features. In this review, I will focus on Trypanosoma brucei RAP1's functions in suppressing telomeric/subtelomeric DNA recombination and in the regulation of monoallelic expression of subtelomere-located major surface antigen genes. Common and unique mechanisms will be compared among RAP1 homologs, and their implications will be discussed.
Collapse
Affiliation(s)
- Bibo Li
- Center for Gene Regulation in Health and Disease, Department of Biological, Geological, and Environmental Sciences, College of Arts and Sciences, Cleveland State University, 2121 Euclid Avenue, Cleveland, OH 44115, USA;
- Case Comprehensive Cancer Center, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, OH 44106, USA
- Department of Inflammation and Immunity, Lerner Research Institute, Cleveland Clinic, 9500 Euclid Avenue, Cleveland, OH 44195, USA
- Center for RNA Science and Therapeutics, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, OH 44106, USA
| |
Collapse
|
8
|
Li B. Telomere maintenance in African trypanosomes. Front Mol Biosci 2023; 10:1302557. [PMID: 38074093 PMCID: PMC10704157 DOI: 10.3389/fmolb.2023.1302557] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Accepted: 11/15/2023] [Indexed: 02/12/2024] Open
Abstract
Telomere maintenance is essential for genome integrity and chromosome stability in eukaryotic cells harboring linear chromosomes, as telomere forms a specialized structure to mask the natural chromosome ends from DNA damage repair machineries and to prevent nucleolytic degradation of the telomeric DNA. In Trypanosoma brucei and several other microbial pathogens, virulence genes involved in antigenic variation, a key pathogenesis mechanism essential for host immune evasion and long-term infections, are located at subtelomeres, and expression and switching of these major surface antigens are regulated by telomere proteins and the telomere structure. Therefore, understanding telomere maintenance mechanisms and how these pathogens achieve a balance between stability and plasticity at telomere/subtelomere will help develop better means to eradicate human diseases caused by these pathogens. Telomere replication faces several challenges, and the "end replication problem" is a key obstacle that can cause progressive telomere shortening in proliferating cells. To overcome this challenge, most eukaryotes use telomerase to extend the G-rich telomere strand. In addition, a number of telomere proteins use sophisticated mechanisms to coordinate the telomerase-mediated de novo telomere G-strand synthesis and the telomere C-strand fill-in, which has been extensively studied in mammalian cells. However, we recently discovered that trypanosomes lack many telomere proteins identified in its mammalian host that are critical for telomere end processing. Rather, T. brucei uses a unique DNA polymerase, PolIE that belongs to the DNA polymerase A family (E. coli DNA PolI family), to coordinate the telomere G- and C-strand syntheses. In this review, I will first briefly summarize current understanding of telomere end processing in mammals. Subsequently, I will describe PolIE-mediated coordination of telomere G- and C-strand synthesis in T. brucei and implication of this recent discovery.
Collapse
Affiliation(s)
- Bibo Li
- Center for Gene Regulation in Health and Disease, Department of Biological, Geological, and Environmental Sciences, College of Arts and Sciences, Cleveland State University, Cleveland, OH, United States
- Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, OH, United States
- Department of Inflammation and Immunity, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, United States
- Center for RNA Science and Therapeutics, Case Western Reserve University, Cleveland, OH, United States
| |
Collapse
|
9
|
Gokhman VE. Chromosome study of the Hymenoptera (Insecta): from cytogenetics to cytogenomics. COMPARATIVE CYTOGENETICS 2023; 17:239-250. [PMID: 37953851 PMCID: PMC10632776 DOI: 10.3897/compcytogen.17.112332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 10/19/2023] [Indexed: 11/14/2023]
Abstract
A brief overview of the current stage of the chromosome study of the insect order Hymenoptera is given. It is demonstrated that, in addition to routine staining and other traditional techniques of chromosome research, karyotypes of an increasing number of hymenopterans are being studied using molecular methods, e.g., staining with base-specific fluorochromes and fluorescence in situ hybridization (FISH), including microdissection and chromosome painting. Due to the advent of whole genome sequencing and other molecular techniques, together with the "big data" approach to the chromosomal data, the current stage of the chromosome research on Hymenoptera represents a transition from Hymenoptera cytogenetics to cytogenomics.
Collapse
Affiliation(s)
- Vladimir E. Gokhman
- Botanical Garden, Moscow State University, Moscow 119234, RussiaMoscow State UniversityMoscowRussia
| |
Collapse
|