1
|
Takeda A, Nonaka D, Imazu Y, Fukunaga T, Hamada M. REPrise: de novo interspersed repeat detection using inexact seeding. Mob DNA 2025; 16:16. [PMID: 40181468 PMCID: PMC11966803 DOI: 10.1186/s13100-025-00353-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2024] [Accepted: 03/17/2025] [Indexed: 04/05/2025] Open
Abstract
BACKGROUND Interspersed repeats occupy a large part of many eukaryotic genomes, and thus their accurate annotation is essential for various genome analyses. Database-free de novo repeat detection approaches are powerful for annotating genomes that lack well-curated repeat databases. However, existing tools do not yet have sufficient repeat detection performance. RESULTS In this study, we developed REPrise, a de novo interspersed repeat detection software program based on a seed-and-extension method. Although the algorithm of REPrise is similar to that of RepeatScout, which is currently the de facto standard tool, we incorporated three unique techniques into REPrise: inexact seeding, affine gap scoring and loose masking. Analyses of rice and simulation genome datasets showed that REPrise outperformed RepeatScout in terms of sensitivity, especially when the repeat sequences contained many mutations. Furthermore, when applied to the complete human genome dataset T2T-CHM13, REPrise demonstrated the potential to detect novel repeat sequence families. CONCLUSION REPrise can detect interspersed repeats with high sensitivity even in long genomes. Our software enhances repeat annotation in diverse genomic studies, contributing to a deeper understanding of genomic structures.
Collapse
Affiliation(s)
- Atsushi Takeda
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, 1698555, Japan
- Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, Tokyo, 1698555, Japan
| | - Daisuke Nonaka
- Department of Computer Science, Graduate School of Information Science and Technology, the University of Tokyo, Tokyo, 1130032, Japan
| | - Yuta Imazu
- Department of Electrical Engineering and Bioscience, School of Advanced Science and Engineering, Waseda University, Tokyo, 1698555, Japan
| | - Tsukasa Fukunaga
- Department of Computer Science, Graduate School of Information Science and Technology, the University of Tokyo, Tokyo, 1130032, Japan.
- Waseda Institute for Advanced Study, Waseda University, Tokyo, 1690051, Japan.
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, 1698555, Japan.
- Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, Tokyo, 1698555, Japan.
- Graduate School of Medicine, Nippon Medical School, Tokyo, 1138602, Japan.
| |
Collapse
|
2
|
Orozco-Arias S, Sierra P, Durbin R, González J. MCHelper automatically curates transposable element libraries across eukaryotic species. Genome Res 2024; 34:2256-2268. [PMID: 39653419 PMCID: PMC11694758 DOI: 10.1101/gr.278821.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Accepted: 09/18/2024] [Indexed: 12/25/2024]
Abstract
The number of species with high-quality genome sequences continues to increase, in part due to the scaling up of multiple large-scale biodiversity sequencing projects. While the need to annotate genic sequences in these genomes is widely acknowledged, the parallel need to annotate transposable element (TE) sequences that have been shown to alter genome architecture, rewire gene regulatory networks, and contribute to the evolution of host traits is becoming ever more evident. However, accurate genome-wide annotation of TE sequences is still technically challenging. Several de novo TE identification tools are now available, but manual curation of the libraries produced by these tools is needed to generate high-quality genome annotations. Manual curation is time-consuming, and thus impractical for large-scale genomic studies, and lacks reproducibility. In this work, we present the Manual Curator Helper tool MCHelper, which automates the TE library curation process. By leveraging MCHelper's fully automated mode with the outputs from three de novo TE identification tools, RepeatModeler2, EDTA, and REPET, in the fruit fly, rice, hooded crow, zebrafish, maize, and human, we show a substantial improvement in the quality of the TE libraries and genome annotations. MCHelper libraries are less redundant, with up to 65% reduction in the number of consensus sequences, have up to 11.4% fewer false positive sequences, and up to ∼48% fewer "unclassified/unknown" TE consensus sequences. Genome-wide TE annotations are also improved, including larger unfragmented insertions. Moreover, MCHelper is an easy-to-install and easy-to-use tool.
Collapse
Affiliation(s)
| | - Pío Sierra
- Department of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom
| | - Richard Durbin
- Department of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom
| | - Josefa González
- Institute of Evolutionary Biology, CSIC, UPF, 08003 Barcelona, Spain;
- Institut Botànic de Barcelona (IBB), CSIC-CMCNB, 08038 Barcelona, Spain
| |
Collapse
|
3
|
Choi BY, Kim J, Park H, Kim J, Han S, Jo IH, Shim D. De Novo Genome Assembly and Phylogenetic Analysis of Cirsium nipponicum. Genes (Basel) 2024; 15:1269. [PMID: 39457393 PMCID: PMC11507141 DOI: 10.3390/genes15101269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 09/20/2024] [Accepted: 09/25/2024] [Indexed: 10/28/2024] Open
Abstract
Background: Cirsium nipponicum, a pharmaceutically valuable plant from the Asteraceae family, has been utilized for over 2000 years. Unlike other thistles, it is native to East Asia and found exclusively on Ulleung Island on the Korea Peninsula. Despite its significance, the genome information of C. nipponicum has remained unclear. Methods: In this study, we assembled the genome of C. nipponicum using both short reads from Illumina sequencing and long reads from Nanopore sequencing. Results: The assembled genome is 929.4 Mb in size with an N50 length of 0.7 Mb, covering 95.1% of BUSCO core groups listed in edicots_odb10. Repeat sequences accounted for 70.94% of the assembled genome. We curated 31,263 protein-coding genes, of which 28,752 were functionally annotated using public databases. Phylogenetic analysis of 11 plant species using single-copy orthologs revealed that C. nipponicum diverged from Cynara cardunculus approximately 15.9 million years ago. Gene family evolutionary analysis revealed significant expansion and contraction in genes involved in abscisic acid biosynthesis, late endosome to vacuole transport, response to nitrate, and abaxial cell fate specification. Conclusions: This study provides a reference genome of C. nipponicum, enhancing our understanding of its genetic background and facilitating an exploration of genetic resources for beneficial phytochemicals.
Collapse
Affiliation(s)
- Bae Young Choi
- School of Liberal Arts and Sciences, Korea National University of Transportation, Chungju 27469, Republic of Korea;
| | - Jaewook Kim
- Department of Biology Education, Korea National University of Education, Cheongju 28173, Republic of Korea;
| | - Hyeonseon Park
- Department of Biological Sciences, Chungnam National University, Daejeon 34134, Republic of Korea;
| | - Jincheol Kim
- Department of Crop Science and Biotechnology, Dankook University, Cheonan 31116, Republic of Korea;
| | - Seahee Han
- Division of Botany, Honam National Institute of Biological Resources, Mokpo 58762, Republic of Korea;
| | - Ick-Hyun Jo
- Department of Crop Science and Biotechnology, Dankook University, Cheonan 31116, Republic of Korea;
| | - Donghwan Shim
- Department of Biological Sciences, Chungnam National University, Daejeon 34134, Republic of Korea;
- Center for Genome Engineering, Institute for Basic Science, Daejeon 34126, Republic of Korea
| |
Collapse
|
4
|
Hu K, Ni P, Xu M, Zou Y, Chang J, Gao X, Li Y, Ruan J, Hu B, Wang J. HiTE: a fast and accurate dynamic boundary adjustment approach for full-length transposable element detection and annotation. Nat Commun 2024; 15:5573. [PMID: 38956036 PMCID: PMC11219922 DOI: 10.1038/s41467-024-49912-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2023] [Accepted: 06/25/2024] [Indexed: 07/04/2024] Open
Abstract
Recent advancements in genome assembly have greatly improved the prospects for comprehensive annotation of Transposable Elements (TEs). However, existing methods for TE annotation using genome assemblies suffer from limited accuracy and robustness, requiring extensive manual editing. In addition, the currently available gold-standard TE databases are not comprehensive, even for extensively studied species, highlighting the critical need for an automated TE detection method to supplement existing repositories. In this study, we introduce HiTE, a fast and accurate dynamic boundary adjustment approach designed to detect full-length TEs. The experimental results demonstrate that HiTE outperforms RepeatModeler2, the state-of-the-art tool, across various species. Furthermore, HiTE has identified numerous novel transposons with well-defined structures containing protein-coding domains, some of which are directly inserted within crucial genes, leading to direct alterations in gene expression. A Nextflow version of HiTE is also available, with enhanced parallelism, reproducibility, and portability.
Collapse
Affiliation(s)
- Kang Hu
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
- Xiangjiang Laboratory, Changsha, 410205, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - Peng Ni
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
- Xiangjiang Laboratory, Changsha, 410205, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - Minghua Xu
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - You Zou
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - Jianye Chang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518000, China
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- Center of Excellence on Smart Health, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, VA, 23529, USA
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518000, China
| | - Bin Hu
- Key Laboratory of Brain Health Intelligent Evaluation and Intervention, Ministry of Education (Beijing Institute of Technology), Beijing, P. R. China.
- School of Medical Technology, Beijing Institute of Technology, Beijing, P. R. China.
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.
- Xiangjiang Laboratory, Changsha, 410205, China.
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China.
| |
Collapse
|
5
|
Baril T, Galbraith J, Hayward A. Earl Grey: A Fully Automated User-Friendly Transposable Element Annotation and Analysis Pipeline. Mol Biol Evol 2024; 41:msae068. [PMID: 38577785 PMCID: PMC11003543 DOI: 10.1093/molbev/msae068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Revised: 02/20/2024] [Accepted: 03/22/2024] [Indexed: 04/06/2024] Open
Abstract
Transposable elements (TEs) are major components of eukaryotic genomes and are implicated in a range of evolutionary processes. Yet, TE annotation and characterization remain challenging, particularly for nonspecialists, since existing pipelines are typically complicated to install, run, and extract data from. Current methods of automated TE annotation are also subject to issues that reduce overall quality, particularly (i) fragmented and overlapping TE annotations, leading to erroneous estimates of TE count and coverage, and (ii) repeat models represented by short sections of total TE length, with poor capture of 5' and 3' ends. To address these issues, we present Earl Grey, a fully automated TE annotation pipeline designed for user-friendly curation and annotation of TEs in eukaryotic genome assemblies. Using nine simulated genomes and an annotation of Drosophila melanogaster, we show that Earl Grey outperforms current widely used TE annotation methodologies in ameliorating the issues mentioned above while scoring highly in benchmarking for TE annotation and classification and being robust across genomic contexts. Earl Grey provides a comprehensive and fully automated TE annotation toolkit that provides researchers with paper-ready summary figures and outputs in standard formats compatible with other bioinformatics tools. Earl Grey has a modular format, with great scope for the inclusion of additional modules focused on further quality control and tailored analyses in future releases.
Collapse
Affiliation(s)
- Tobias Baril
- Centre for Ecology and Conservation, University of Exeter, Penryn Campus, Cornwall TR10 9FE, UK
- Laboratory of Evolutionary Genetics, Institute of Biology, University of Neuchâtel, 2000 Neuchâtel, Switzerland
| | - James Galbraith
- Centre for Ecology and Conservation, University of Exeter, Penryn Campus, Cornwall TR10 9FE, UK
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh EH9 3FL, UK
| | - Alex Hayward
- Centre for Ecology and Conservation, University of Exeter, Penryn Campus, Cornwall TR10 9FE, UK
| |
Collapse
|
6
|
Guan DL, Chen YZ, Qin YC, Li XD, Deng WA. Chromosomal-Level Reference Genome for the Chinese Endemic Pygmy Grasshopper, Zhengitettix transpicula, Sheds Light on Tetrigidae Evolution and Advancing Conservation Efforts. INSECTS 2024; 15:223. [PMID: 38667352 PMCID: PMC11049975 DOI: 10.3390/insects15040223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/22/2024] [Revised: 03/15/2024] [Accepted: 03/16/2024] [Indexed: 04/28/2024]
Abstract
The pygmy grasshopper, Zhengitettix transpicula, is a Chinese endemic species with an exceedingly limited distribution and fragile population structure, rendering it vulnerable to extinction. We present a high-continuity, chromosome-scale reference genome assembly to elucidate this species' distinctive biology and inform conservation. Employing an integrated sequencing approach, we achieved a 970.40 Mb assembly with 96.32% coverage across seven pseudo-chromosomes and impressive continuity (N50 > 220 Mb). Genome annotation achieves identification with 99.2% BUSCO completeness, supporting quality. Comparative analyses with 14 genomes from Orthoptera-facilitated phylogenomics and revealed 549 significantly expanded gene families in Z. transpicula associated with metabolism, stress response, and development. However, genomic analysis exposed remarkably low heterozygosity (0.02%), implying a severe genetic bottleneck from small, fragmented populations, characteristic of species vulnerable to extinction from environmental disruptions. Elucidating the genetic basis of population dynamics and specialization provides an imperative guideline for habitat conservation and restoration of this rare organism. Moreover, divergent evolution analysis of the CYP305m2 gene regulating locust aggregation highlighted potential structural and hence functional variations between Acrididae and Tetrigidae. Our chromosomal genomic characterization of Z. transpicula advances Orthopteran resources, establishing a framework for evolutionary developmental explorations and applied conservation genomics, reversing the trajectory of this unique grasshopper lineage towards oblivion.
Collapse
Affiliation(s)
- De-Long Guan
- Key Laboratory of Ecology of Rare and Endangered Species and Environmental Protection, Guangxi Normal University, Ministry of Education, Guilin 541006, China; (D.-L.G.); (Y.-C.Q.)
- Guangxi Key Laboratory of Sericulture Ecology and Applied Intelligent Technology, School of Chemistry and Bioengineering, Hechi University, Hechi 546300, China;
| | - Ya-Zhen Chen
- Guangxi Key Laboratory of Sericulture Ecology and Applied Intelligent Technology, School of Chemistry and Bioengineering, Hechi University, Hechi 546300, China;
| | - Ying-Can Qin
- Key Laboratory of Ecology of Rare and Endangered Species and Environmental Protection, Guangxi Normal University, Ministry of Education, Guilin 541006, China; (D.-L.G.); (Y.-C.Q.)
- Guangxi Key Laboratory of Sericulture Ecology and Applied Intelligent Technology, School of Chemistry and Bioengineering, Hechi University, Hechi 546300, China;
| | - Xiao-Dong Li
- Key Laboratory of Ecology of Rare and Endangered Species and Environmental Protection, Guangxi Normal University, Ministry of Education, Guilin 541006, China; (D.-L.G.); (Y.-C.Q.)
- Guangxi Key Laboratory of Sericulture Ecology and Applied Intelligent Technology, School of Chemistry and Bioengineering, Hechi University, Hechi 546300, China;
| | - Wei-An Deng
- Key Laboratory of Ecology of Rare and Endangered Species and Environmental Protection, Guangxi Normal University, Ministry of Education, Guilin 541006, China; (D.-L.G.); (Y.-C.Q.)
- Guangxi Key Laboratory of Sericulture Ecology and Applied Intelligent Technology, School of Chemistry and Bioengineering, Hechi University, Hechi 546300, China;
| |
Collapse
|
7
|
Liu X, Zhao L, Majid M, Huang Y. Orthoptera-TElib: a library of Orthoptera transposable elements for TE annotation. Mob DNA 2024; 15:5. [PMID: 38486291 PMCID: PMC10941475 DOI: 10.1186/s13100-024-00316-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2023] [Accepted: 03/08/2024] [Indexed: 03/17/2024] Open
Abstract
Transposable elements (TEs) are a major component of eukaryotic genomes and are present in almost all eukaryotic organisms. TEs are highly dynamic between and within species, which significantly affects the general applicability of the TE databases. Orthoptera is the only known group in the class Insecta with a significantly enlarged genome (0.93-21.48 Gb). When analyzing the large genome using the existing TE public database, the efficiency of TE annotation is not satisfactory. To address this limitation, it becomes imperative to continually update the available TE resource library and the need for an Orthoptera-specific library as more insect genomes are publicly available. Here, we used the complete genome data of 12 Orthoptera species to de novo annotate TEs, then manually re-annotate the unclassified TEs to construct a non-redundant Orthoptera-specific TE library: Orthoptera-TElib. Orthoptera-TElib contains 24,021 TE entries including the re-annotated results of 13,964 unknown TEs. The naming of TE entries in Orthoptera-TElib adopts the same naming as RepeatMasker and Dfam and is encoded as the three-level form of "level1/level2-level3". Orthoptera-TElib can be directly used as an input reference database and is compatible with mainstream repetitive sequence analysis software such as RepeatMasker and dnaPipeTE. When analyzing TEs of Orthoptera species, Orthoptera-TElib performs better TE annotation as compared to Dfam and Repbase regardless of using low-coverage sequencing or genome assembly data. The most improved TE annotation result is Angaracris rhodopa, which has increased from 7.89% of the genome to 53.28%. Finally, Orthoptera-TElib is stored in Sqlite3 for the convenience of data updates and user access.
Collapse
Affiliation(s)
- Xuanzeng Liu
- College of Life Sciences, Shaanxi Normal University, Xi'an, China
| | - Lina Zhao
- College of Life Sciences, Shaanxi Normal University, Xi'an, China
| | - Muhammad Majid
- College of Life Sciences, Shaanxi Normal University, Xi'an, China
| | - Yuan Huang
- College of Life Sciences, Shaanxi Normal University, Xi'an, China.
| |
Collapse
|
8
|
Al-Jawabreh R, Lastik D, McKenzie D, Reynolds K, Suleiman M, Mousley A, Atkinson L, Hunt V. Advancing Strongyloides omics data: bridging the gap with Caenorhabditis elegans. Philos Trans R Soc Lond B Biol Sci 2024; 379:20220437. [PMID: 38008117 PMCID: PMC10676819 DOI: 10.1098/rstb.2022.0437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 08/31/2023] [Indexed: 11/28/2023] Open
Abstract
Among nematodes, the free-living model organism Caenorhabditis elegans boasts the most advanced portfolio of high-quality omics data. The resources available for parasitic nematodes, including Strongyloides spp., however, are lagging behind. While C. elegans remains the most tractable nematode and has significantly advanced our understanding of many facets of nematode biology, C. elegans is not suitable as a surrogate system for the study of parasitism and it is important that we improve the omics resources available for parasitic nematode species. Here, we review the omics data available for Strongyloides spp. and compare the available resources to those for C. elegans and other parasitic nematodes. The advancements in C. elegans omics offer a blueprint for improving omics-led research in Strongyloides. We suggest areas of priority for future research that will pave the way for expansions in omics resources and technologies. This article is part of the Theo Murphy meeting issue 'Strongyloides: omics to worm-free populations'.
Collapse
Affiliation(s)
- Reem Al-Jawabreh
- Department of Life Sciences, University of Bath, Bath, BA2 7AY, UK
| | - Dominika Lastik
- Department of Life Sciences, University of Bath, Bath, BA2 7AY, UK
| | | | - Kieran Reynolds
- Department of Life Sciences, University of Bath, Bath, BA2 7AY, UK
| | - Mona Suleiman
- Department of Life Sciences, University of Bath, Bath, BA2 7AY, UK
| | | | | | - Vicky Hunt
- Department of Life Sciences, University of Bath, Bath, BA2 7AY, UK
| |
Collapse
|
9
|
Feldmeyer B, Bornberg-Bauer E, Dohmen E, Fouks B, Heckenhauer J, Huylmans AK, Jones ARC, Stolle E, Harrison MC. Comparative Evolutionary Genomics in Insects. Methods Mol Biol 2024; 2802:473-514. [PMID: 38819569 DOI: 10.1007/978-1-0716-3838-5_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Genome sequencing quality, in terms of both read length and accuracy, is constantly improving. By combining long-read sequencing technologies with various scaffolding techniques, chromosome-level genome assemblies are now achievable at an affordable price for non-model organisms. Insects represent an exciting taxon for studying the genomic underpinnings of evolutionary innovations, due to ancient origins, immense species-richness, and broad phenotypic diversity. Here we summarize some of the most important methods for carrying out a comparative genomics study on insects. We describe available tools and offer concrete tips on all stages of such an endeavor from DNA extraction through genome sequencing, annotation, and several evolutionary analyses. Along the way we describe important insect-specific aspects, such as DNA extraction difficulties or gene families that are particularly difficult to annotate, and offer solutions. We describe results from several examples of comparative genomics analyses on insects to illustrate the fascinating questions that can now be addressed in this new age of genomics research.
Collapse
Affiliation(s)
- Barbara Feldmeyer
- Senckenberg Biodiversity and Climate Research Centre (SBiK-F), Molecular Ecology, Frankfurt, Germany
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
- Department of Protein Evolution, Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Elias Dohmen
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | - Bertrand Fouks
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | - Jacqueline Heckenhauer
- LOEWE Centre for Translational Biodiversity Genomics (LOEWE-TBG), Frankfurt, Germany
- Department of Terrestrial Zoology, Senckenberg Research Institute and Natural History Museum Frankfurt, Frankfurt, Germany
| | - Ann Kathrin Huylmans
- Institute of Organismic and Molecular Evolution, Johannes Gutenberg University, Mainz, Germany
| | - Alun R C Jones
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | - Eckart Stolle
- Museum Koenig, Leibniz Institute for the Analysis of Biodiversity Change (LIB), Bonn, Germany
| | - Mark C Harrison
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany.
| |
Collapse
|
10
|
Liao X, Zhu W, Zhou J, Li H, Xu X, Zhang B, Gao X. Repetitive DNA sequence detection and its role in the human genome. Commun Biol 2023; 6:954. [PMID: 37726397 PMCID: PMC10509279 DOI: 10.1038/s42003-023-05322-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 09/04/2023] [Indexed: 09/21/2023] Open
Abstract
Repetitive DNA sequences playing critical roles in driving evolution, inducing variation, and regulating gene expression. In this review, we summarized the definition, arrangement, and structural characteristics of repeats. Besides, we introduced diverse biological functions of repeats and reviewed existing methods for automatic repeat detection, classification, and masking. Finally, we analyzed the type, structure, and regulation of repeats in the human genome and their role in the induction of complex diseases. We believe that this review will facilitate a comprehensive understanding of repeats and provide guidance for repeat annotation and in-depth exploration of its association with human diseases.
Collapse
Affiliation(s)
- Xingyu Liao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Wufei Zhu
- Department of Endocrinology, Yichang Central People's Hospital, The First College of Clinical Medical Science, China Three Gorges University, 443000, Yichang, P.R. China
| | - Juexiao Zhou
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Haoyang Li
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Xiaopeng Xu
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Bin Zhang
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia.
| |
Collapse
|
11
|
Gao Y, Liao HB, Liu TH, Wu JM, Wang ZF, Cao HL. Draft genome and transcriptome of Nepenthes mirabilis, a carnivorous plant in China. BMC Genom Data 2023; 24:21. [PMID: 37060047 PMCID: PMC10103442 DOI: 10.1186/s12863-023-01126-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Accepted: 04/06/2023] [Indexed: 04/16/2023] Open
Abstract
OBJECTIVES Nepenthes belongs to the monotypic family Nepenthaceae, one of the largest carnivorous plant families. Nepenthes species show impressive adaptive radiation and suffer from being overexploited in nature. Nepenthes mirabilis is the most widely distributed species and the only Nepenthes species that is naturally distributed within China. Herein, we reported the genome and transcriptome assemblies of N. mirabilis. The assemblies will be useful resources for comparative genomics, to understand the adaptation and conservation of carnivorous species. DATA DESCRIPTION This work produced ~ 139.5 Gb N. mirabilis whole genome sequencing reads using leaf tissues, and ~ 21.7 Gb and ~ 27.9 Gb of raw RNA-seq reads for its leaves and flowers, respectively. Transcriptome assembly obtained 339,802 transcripts, in which 79,758 open reading frames (ORFs) were identified. Function analysis indicated that these ORFs were mainly associated with proteolysis and DNA integration. The assembled genome was 691,409,685 bp with 159,555 contigs/scaffolds and an N50 of 10,307 bp. The BUSCO assessment of the assembled genome and transcriptome indicated 91.1% and 93.7% completeness, respectively. A total of 42,961 genes were predicted in the genome identified, coding for 45,461 proteins. The predicted genes were annotated using multiple databases, facilitating future functional analyses of them. This is the first genome report on the Nepenthaceae family.
Collapse
Affiliation(s)
- Yuan Gao
- Zhongshan Management Centre of the Natural Protected Area, Zhongshan, China
| | - Hao-Bin Liao
- Zhongshan Management Centre of the Natural Protected Area, Zhongshan, China
| | - Ting-Hong Liu
- Guangdong Provincial Key Laboratory of Applied Botany, Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, China
| | - Jia-Ming Wu
- Zhongshan Management Centre of the Natural Protected Area, Zhongshan, China
| | - Zheng-Feng Wang
- Guangdong Provincial Key Laboratory of Applied Botany, Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, China.
| | - Hong-Lin Cao
- Guangdong Provincial Key Laboratory of Applied Botany, Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, China.
| |
Collapse
|
12
|
Orozco-Arias S, Humberto Lopez-Murillo L, Candamil-Cortés MS, Arias M, Jaimes PA, Rossi Paschoal A, Tabares-Soto R, Isaza G, Guyot R. Inpactor2: a software based on deep learning to identify and classify LTR-retrotransposons in plant genomes. Brief Bioinform 2022; 24:6887110. [PMID: 36502372 PMCID: PMC9851300 DOI: 10.1093/bib/bbac511] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 10/13/2022] [Accepted: 10/26/2022] [Indexed: 12/14/2022] Open
Abstract
LTR-retrotransposons are the most abundant repeat sequences in plant genomes and play an important role in evolution and biodiversity. Their characterization is of great importance to understand their dynamics. However, the identification and classification of these elements remains a challenge today. Moreover, current software can be relatively slow (from hours to days), sometimes involve a lot of manual work and do not reach satisfactory levels in terms of precision and sensitivity. Here we present Inpactor2, an accurate and fast application that creates LTR-retrotransposon reference libraries in a very short time. Inpactor2 takes an assembled genome as input and follows a hybrid approach (deep learning and structure-based) to detect elements, filter partial sequences and finally classify intact sequences into superfamilies and, as very few tools do, into lineages. This tool takes advantage of multi-core and GPU architectures to decrease execution times. Using the rice genome, Inpactor2 showed a run time of 5 minutes (faster than other tools) and has the best accuracy and F1-Score of the tools tested here, also having the second best accuracy and specificity only surpassed by EDTA, but achieving 28% higher sensitivity. For large genomes, Inpactor2 is up to seven times faster than other available bioinformatics tools.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Corresponding authors. Simon Orozco-Arias, Computer Science Department, Universidad Autónoma de Manizales, Antigua Estación del Ferrocarrill, Manizalez, Colombia. Tel.: +57(606)8727272 - 8727709 Ext 102; E-mail: ; Alexandre Rossi Paschoal, Department of Computer Science, Bioinformatics and Pattern Recognition Group, Graduation Program in Bioinformatics, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio, Paraná, 86300-000, Brazil. Tel.: +433133-3790; E-mail: ; Gustavo Isaza, Systems and Informatics Department, Center for Technology Development - Bioprocess and Agro-industry Plant, Universidad de Caldas, St 65 #26-10, Manizales, Colombia. Tel.: +57(606)8781500 ext 13146; E-mail: , Romain Guyot, IRD, 911 Av. Agropolis, 34394 Montpellier, France. Tel.: +334674160000; E-mail:
| | | | | | - Maradey Arias
- Department of Computer Science, Universidad Autónoma de Manizales, 170001, Caldas, Colombia
| | - Paula A Jaimes
- Department of Computer Science, Universidad Autónoma de Manizales, 170001, Caldas, Colombia
| | - Alexandre Rossi Paschoal
- Corresponding authors. Simon Orozco-Arias, Computer Science Department, Universidad Autónoma de Manizales, Antigua Estación del Ferrocarrill, Manizalez, Colombia. Tel.: +57(606)8727272 - 8727709 Ext 102; E-mail: ; Alexandre Rossi Paschoal, Department of Computer Science, Bioinformatics and Pattern Recognition Group, Graduation Program in Bioinformatics, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio, Paraná, 86300-000, Brazil. Tel.: +433133-3790; E-mail: ; Gustavo Isaza, Systems and Informatics Department, Center for Technology Development - Bioprocess and Agro-industry Plant, Universidad de Caldas, St 65 #26-10, Manizales, Colombia. Tel.: +57(606)8781500 ext 13146; E-mail: , Romain Guyot, IRD, 911 Av. Agropolis, 34394 Montpellier, France. Tel.: +334674160000; E-mail:
| | - Reinel Tabares-Soto
- Department of Electronics and Automation, Universidad Autónoma de Manizales, 170001, Caldas, Colombia
| | - Gustavo Isaza
- Corresponding authors. Simon Orozco-Arias, Computer Science Department, Universidad Autónoma de Manizales, Antigua Estación del Ferrocarrill, Manizalez, Colombia. Tel.: +57(606)8727272 - 8727709 Ext 102; E-mail: ; Alexandre Rossi Paschoal, Department of Computer Science, Bioinformatics and Pattern Recognition Group, Graduation Program in Bioinformatics, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio, Paraná, 86300-000, Brazil. Tel.: +433133-3790; E-mail: ; Gustavo Isaza, Systems and Informatics Department, Center for Technology Development - Bioprocess and Agro-industry Plant, Universidad de Caldas, St 65 #26-10, Manizales, Colombia. Tel.: +57(606)8781500 ext 13146; E-mail: , Romain Guyot, IRD, 911 Av. Agropolis, 34394 Montpellier, France. Tel.: +334674160000; E-mail:
| | - Romain Guyot
- Corresponding authors. Simon Orozco-Arias, Computer Science Department, Universidad Autónoma de Manizales, Antigua Estación del Ferrocarrill, Manizalez, Colombia. Tel.: +57(606)8727272 - 8727709 Ext 102; E-mail: ; Alexandre Rossi Paschoal, Department of Computer Science, Bioinformatics and Pattern Recognition Group, Graduation Program in Bioinformatics, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio, Paraná, 86300-000, Brazil. Tel.: +433133-3790; E-mail: ; Gustavo Isaza, Systems and Informatics Department, Center for Technology Development - Bioprocess and Agro-industry Plant, Universidad de Caldas, St 65 #26-10, Manizales, Colombia. Tel.: +57(606)8781500 ext 13146; E-mail: , Romain Guyot, IRD, 911 Av. Agropolis, 34394 Montpellier, France. Tel.: +334674160000; E-mail:
| |
Collapse
|
13
|
Kim J, Park MJ, Shim D, Ryoo R. De novo genome assembly of the bioluminescent mushroom Omphalotus guepiniiformis reveals an Omphalotus-specific lineage of the luciferase gene block. Genomics 2022; 114:110514. [PMID: 36332840 DOI: 10.1016/j.ygeno.2022.110514] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 10/04/2022] [Accepted: 10/23/2022] [Indexed: 11/05/2022]
Abstract
Omphalotus guepiniiformis, a bioluminescent mushroom species, is a source of the potentially valuable anticancer chemical. To provide genome information, we de novo assembled the high-quality O. guepiniiformis genome using two Next-Generation sequencing techniques, PacBio and Illumina sequencing. Our draft O. guepiniiformis genome comprises 42.5 Mbp of sequence with only 80 contigs and an N50 sequence length of over 1 Mbp. There were 15,554 predicted coding genes, and 7693 genes were functionally annotated with Gene Ontology terms. We performed a genomic study focusing on mushroom bioluminescent pathway cluster genes by comparing 17 luminescent and 23 non-luminescent Agaricales species belonging to 23 genera. Synteny analysis of genomic regions near the luminescent pathway cluster genes inferred that the Omphalotus lineage was genus-specific. In summary, our de novo assembled O. guepiniiformis genome provides significant biological insights into this organism, including the evolution of the luciferase gene block, and forms the basis for future analyses.
Collapse
Affiliation(s)
- Jaewook Kim
- Department of Biological Sciences, Chungnam National University, 34134 Daejeon, Republic of Korea
| | - Mi-Jeong Park
- Forest Microbiology Division, Department of Forest Bio-Resources, National Institute of Forest Science, 16631 Suwon, Republic of Korea
| | - Donghwan Shim
- Department of Biological Sciences, Chungnam National University, 34134 Daejeon, Republic of Korea.
| | - Rhim Ryoo
- Forest Microbiology Division, Department of Forest Bio-Resources, National Institute of Forest Science, 16631 Suwon, Republic of Korea.
| |
Collapse
|
14
|
Lexa M, Cechova M, Nguyen SH, Jedlicka P, Tokan V, Kubat Z, Hobza R, Kejnovsky E. HiC-TE: a computational pipeline for Hi-C data analysis to study the role of repeat family interactions in the genome 3D organization. Bioinformatics 2022; 38:4030-4032. [PMID: 35781332 DOI: 10.1093/bioinformatics/btac442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Revised: 06/14/2022] [Accepted: 06/30/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The role of repetitive DNA in the 3D organization of the interphase nucleus is a subject of intensive study. In studies of 3D nucleus organization, mutual contacts of various loci can be identified by Hi-C sequencing. Typical analyses use binning of read pairs by location to reduce noise. We use binning by repeat families instead to make similar conclusions about repeat regions. RESULTS To achieve this, we combined Hi-C data, reference genome data and tools for repeat analysis into a Nextflow pipeline identifying and quantifying the contacts of specific repeat families. As an output, our pipeline produces heatmaps showing contact frequency and circular diagrams visualizing repeat contact localization. Using our pipeline with tomato data, we revealed the preferential homotypic interactions of ribosomal DNA, centromeric satellites and some LTR retrotransposon families and, as expected, little contact between organellar and nuclear DNA elements. While the pipeline can be applied to any eukaryotic genome, results in plants provide better coverage, since the built-in TE-greedy-nester software only detects tandems and LTR retrotransposons. Other repeats can be fed via GFF3 files. This pipeline represents a novel and reproducible way to analyze the role of repetitive elements in the 3D organization of genomes. AVAILABILITY AND IMPLEMENTATION https://gitlab.fi.muni.cz/lexa/hic-te/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Matej Lexa
- Faculty of Informatics, Masaryk University, 60200 Brno, Czech Republic.,Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, 61200 Brno, Czech Republic
| | - Monika Cechova
- Faculty of Informatics, Masaryk University, 60200 Brno, Czech Republic
| | - Son Hoang Nguyen
- Faculty of Informatics, Masaryk University, 60200 Brno, Czech Republic
| | - Pavel Jedlicka
- Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, 61200 Brno, Czech Republic
| | - Viktor Tokan
- Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, 61200 Brno, Czech Republic
| | - Zdenek Kubat
- Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, 61200 Brno, Czech Republic
| | - Roman Hobza
- Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, 61200 Brno, Czech Republic
| | - Eduard Kejnovsky
- Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, 61200 Brno, Czech Republic
| |
Collapse
|
15
|
Mobilome of Apicomplexa Parasites. Genes (Basel) 2022; 13:genes13050887. [PMID: 35627271 PMCID: PMC9141347 DOI: 10.3390/genes13050887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 05/02/2022] [Accepted: 05/14/2022] [Indexed: 02/04/2023] Open
Abstract
Transposable elements (TEs) are mobile genetic elements found in the majority of eukaryotic genomes. Genomic studies of protozoan parasites from the phylum Apicomplexa have only reported a handful of TEs in some species and a complete absence in others. Here, we studied sixty-four Apicomplexa genomes available in public databases, using a ‘de novo’ approach to build candidate TE models and multiple strategies from known TE sequence databases, pattern recognition of TEs, and protein domain databases, to identify possible TEs. We offer an insight into the distribution and the type of TEs that are present in these genomes, aiming to shed some light on the process of gains and losses of TEs in this phylum. We found that TEs comprise a very small portion in these genomes compared to other organisms, and in many cases, there are no apparent traces of TEs. We were able to build and classify 151 models from the TE consensus sequences obtained with RepeatModeler, 96 LTR TEs with LTRpred, and 44 LINE TEs with MGEScan. We found LTR Gypsy-like TEs in Eimeria, Gregarines, Haemoproteus, and Plasmodium genera. Additionally, we described LINE-like TEs in some species from the genera Babesia and Theileria. Finally, we confirmed the absence of TEs in the genus Cryptosporidium. Interestingly, Apicomplexa seem to be devoid of Class II transposons.
Collapse
|