1
|
Giraud D, Choisne N, Summo M, Sidibe-Bocs S, Vassilieff H, Costantino G, Droc G, Teycheney PY, Maumus F, Ollitrault P, Luro F. Construction of a comprehensive library of repeated sequences for the annotation of Citrus genomes. BMC Genom Data 2025; 26:30. [PMID: 40247189 PMCID: PMC12007355 DOI: 10.1186/s12863-025-01321-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2025] [Accepted: 04/09/2025] [Indexed: 04/19/2025] Open
Abstract
BACKGROUND The comprehensive annotation of repeated sequences in genomes is an essential prerequisite for studying the dynamics of these sequences over time and their involvement in gene regulation. Currently, the diversity of repeated sequences in Citrus genomes is only partially characterized because the annotations have been performed using heterogeneous bioinformatics tools, each with its specificity and dedicated only to the annotation of transposable elements. RESULTS We combined complementary repeat-finding programs including REPET, CAULIFINDER, and TAREAN, to enable the identification of all types of repetitive sequences found in plant genomes, including transposable elements, endogenous caulimovirids, and satellite DNAs. A fine-grained annotation method was first developed to create a consensus sequence library of repeated sequences identified in the genome assemblies of C. medica, C. micrantha, C. reticulata, and C. maxima, the four ancestral parental species involved in the formation of economically valuable cultivated Citrus varieties. A second, faster annotation method was developed to enrich the dataset by adding new repeated sequences retrieved from genome assemblies of other Citrus species and closely related species belonging to the Aurantioideae subfamily. The final reference library contains 3,091 consensus sequences, of which 94.5% are transposable elements. The diversity of endogenous caulimovirids was characterized for the first time within the genus Citrus, contributing 160 consensus sequences to the final reference library. Finally, 10 satellite DNAs were also identified. CONCLUSION Combining multiple repeat detection methods enables the comprehensive annotation of all repeated sequences in Citrus genomes. Using the final reference library reported in this work will improve our understanding of the dynamics of repeated sequences during Citrus speciation, particularly following the genome duplication and hybridization events that led to modern cultivars. The exploration of repeat position insertions along chromosomes using the developed web interface, RepeatLoc Citrus, will also make it possible to further investigate the role of transposable elements and endogenous caulimovirids in genome structure and gene regulation in Citrus species.
Collapse
Affiliation(s)
- Delphine Giraud
- UR AGAP Corse, INRAE, Institut Agro, CIRAD, University of Montpellier, San Giuliano, F-20230, France.
| | - Nathalie Choisne
- URGI, INRAE, Université Paris-Saclay, Versailles, F-78026, France
| | - Marilyne Summo
- UMR AGAP, CIRAD, Institut Agro, INRAE, University of Montpellier, Montpellier, F-34060, France
- UMR AGAP, CIRAD, Montpellier, F-34398, France
| | - Stéphanie Sidibe-Bocs
- UMR AGAP, CIRAD, Institut Agro, INRAE, University of Montpellier, Montpellier, F-34060, France
- UMR AGAP, CIRAD, Montpellier, F-34398, France
| | | | - Gilles Costantino
- UR AGAP Corse, INRAE, Institut Agro, CIRAD, University of Montpellier, San Giuliano, F-20230, France
| | - Gaetan Droc
- UMR AGAP, CIRAD, Institut Agro, INRAE, University of Montpellier, Montpellier, F-34060, France
- UMR AGAP, CIRAD, Montpellier, F-34398, France
| | - Pierre-Yves Teycheney
- CIRAD, UMR PVBMT, Saint Pierre, La Réunion, F-97410, France
- UMR PVBMT, Université de la Réunion, Saint-Pierre de La Réunion, F-97410, France
| | - Florian Maumus
- URGI, INRAE, Université Paris-Saclay, Versailles, F-78026, France
| | - Patrick Ollitrault
- UMR AGAP, CIRAD, Institut Agro, INRAE, University of Montpellier, Montpellier, F-34060, France
- UMR AGAP, CIRAD, Montpellier, F-34398, France
| | - François Luro
- UR AGAP Corse, INRAE, Institut Agro, CIRAD, University of Montpellier, San Giuliano, F-20230, France
| |
Collapse
|
2
|
Hurgobin B. Annotation of Protein-Coding Genes in Plant Genomes. Methods Mol Biol 2022; 2443:309-326. [PMID: 35037214 DOI: 10.1007/978-1-0716-2067-0_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Advances in next-generation sequencing technologies and the lower sequencing costs are paving the way to more plant genome sequencing, assembly, and annotation projects. While genome assembly is the first step toward elucidating the genome structure of a species, it is the annotation of the protein-coding genes that provide meaningful information to biologists. However, genome annotation is not a trivial task. Therefore, the aim of this chapter is to provide a detailed view of this important process, including tools and commands that can be used to carry out such a process.
Collapse
Affiliation(s)
- Bhavna Hurgobin
- La Trobe Institute for Agriculture and Food, Department of Animal, Plant and Soil Sciences, School of Life Sciences, AgriBio Building, La Trobe University, Bundoora, VIC, Australia.
- Australian Research Council Research Hub for Medicinal Agriculture, AgriBio Building, La Trobe University, Bundoora, VIC, Australia.
| |
Collapse
|
3
|
Liao X, Li M, Hu K, Wu FX, Gao X, Wang J. A sensitive repeat identification framework based on short and long reads. Nucleic Acids Res 2021; 49:e100. [PMID: 34214175 PMCID: PMC8464074 DOI: 10.1093/nar/gkab563] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2020] [Revised: 06/08/2021] [Accepted: 06/18/2021] [Indexed: 12/11/2022] Open
Abstract
Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https://github.com/BioinformaticsCSU/LongRepMarker).
Collapse
Affiliation(s)
- Xingyu Liao
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, P.R. China
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Min Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, P.R. China
| | - Kang Hu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, P.R. China
| | - Fang-Xiang Wu
- Department of Mechanical Engineering and Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N5A9, Canada
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, P.R. China
| |
Collapse
|
4
|
Orozco-Arias S, Candamil-Cortés MS, Jaimes PA, Piña JS, Tabares-Soto R, Guyot R, Isaza G. K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes. PeerJ 2021; 9:e11456. [PMID: 34055489 PMCID: PMC8140598 DOI: 10.7717/peerj.11456] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 04/24/2021] [Indexed: 12/15/2022] Open
Abstract
Every day more plant genomes are available in public databases and additional massive sequencing projects (i.e., that aim to sequence thousands of individuals) are formulated and released. Nevertheless, there are not enough automatic tools to analyze this large amount of genomic information. LTR retrotransposons are the most frequent repetitive sequences in plant genomes; however, their detection and classification are commonly performed using semi-automatic and time-consuming programs. Despite the availability of several bioinformatic tools that follow different approaches to detect and classify them, none of these tools can individually obtain accurate results. Here, we used Machine Learning algorithms based on k-mer counts to classify LTR retrotransposons from other genomic sequences and into lineages/families with an F1-Score of 95%, contributing to develop a free-alignment and automatic method to analyze these sequences.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia.,Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia
| | | | - Paula A Jaimes
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| | - Johan S Piña
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| | - Reinel Tabares-Soto
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| | - Romain Guyot
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia.,Institut de Recherche pour le Développement, CIRAD, Univ. Montpellier, Montpellier, France
| | - Gustavo Isaza
- Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia
| |
Collapse
|
5
|
Nesbit JB, Schein CH, Braun BA, Gipson SAY, Cheng H, Hurlburt BK, Maleki SJ. Epitopes with similar physicochemical properties contribute to cross reactivity between peanut and tree nuts. Mol Immunol 2020; 122:223-231. [PMID: 32442779 DOI: 10.1016/j.molimm.2020.03.017] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2019] [Revised: 02/11/2020] [Accepted: 03/26/2020] [Indexed: 12/28/2022]
Abstract
Many individuals with peanut (PN) allergy have severe reactions to tree nuts (TN) such as walnuts or cashews. Although allergenic proteins in TN and PN have overall low identity, they share discrete sequences similar in physicochemical properties (PCP) to known IgE epitopes. Here, PCP-consensus peptides (cp, 13 aa and 31 aa) were identified from an alignment of epitope rich regions of walnut vicilin, Jug r 2, leader sequence (J2LS) and cross-reactive epitopes in the 2S albumins of peanut and synthesized. A peptide similarity search in the Structural Database of Allergenic Proteins (SDAP) revealed a network of peptides similar (low property distance, PD) to the 13 aa cp (13cp) in many different plant allergens. Peptides similar to the 13cp in PN and TN allergens bound IgE from sera of patients allergic to PN and TN in peptide microarray analysis. The 13cp was used to produce a rabbit consensus peptide antibody (cpAB) that detected proteins containing repeats similar to the 13cp in western blots of various nut extracts, in which reactive proteins were identified by mass spectrometry. The cpAB bound more specifically to allergens and nut extracts containing multiple repeats similar to the 13 cp, such as almond (Pru du 6), peanut (Ara h 2) and walnut (Jug r 2). IgE binding to various nut extracts is inhibited by recombinant J2LS sequence and synthetic 31cp. Thus, several repeated sequences similar to the 13cp are bound by IgE. Multiple similar repeats in several allergens could account for reaction severity and clinically relevant cross-reactivity to PN and TN. These findings may help improve detection, diagnostic, and therapeutic tools.
Collapse
Affiliation(s)
- Jacqueline B Nesbit
- Dept of Agriculture-Agricultural Research Service-Southern Regional Research Center (USDA-ARS-SRRC), New Orleans, LA, United States
| | - Catherine H Schein
- Department of Biochemistry and Molecular Biology, Institute for Human Infection and Immunity, University of Texas Medical Branch at Galveston (UTMB), TX, United States.
| | - Benjamin A Braun
- Department of Computer Science, Stanford University, United States
| | - Stephen A Y Gipson
- Dept of Agriculture-Agricultural Research Service-Southern Regional Research Center (USDA-ARS-SRRC), New Orleans, LA, United States
| | - Hsiaopo Cheng
- Dept of Agriculture-Agricultural Research Service-Southern Regional Research Center (USDA-ARS-SRRC), New Orleans, LA, United States
| | - Barry K Hurlburt
- Dept of Agriculture-Agricultural Research Service-Southern Regional Research Center (USDA-ARS-SRRC), New Orleans, LA, United States
| | - Soheila J Maleki
- Dept of Agriculture-Agricultural Research Service-Southern Regional Research Center (USDA-ARS-SRRC), New Orleans, LA, United States.
| |
Collapse
|
6
|
Orozco-Arias S, Isaza G, Guyot R. Retrotransposons in Plant Genomes: Structure, Identification, and Classification through Bioinformatics and Machine Learning. Int J Mol Sci 2019; 20:E3837. [PMID: 31390781 PMCID: PMC6696364 DOI: 10.3390/ijms20153837] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Revised: 07/31/2019] [Accepted: 08/02/2019] [Indexed: 01/26/2023] Open
Abstract
Transposable elements (TEs) are genomic units able to move within the genome of virtually all organisms. Due to their natural repetitive numbers and their high structural diversity, the identification and classification of TEs remain a challenge in sequenced genomes. Although TEs were initially regarded as "junk DNA", it has been demonstrated that they play key roles in chromosome structures, gene expression, and regulation, as well as adaptation and evolution. A highly reliable annotation of these elements is, therefore, crucial to better understand genome functions and their evolution. To date, much bioinformatics software has been developed to address TE detection and classification processes, but many problematic aspects remain, such as the reliability, precision, and speed of the analyses. Machine learning and deep learning are algorithms that can make automatic predictions and decisions in a wide variety of scientific applications. They have been tested in bioinformatics and, more specifically for TEs, classification with encouraging results. In this review, we will discuss important aspects of TEs, such as their structure, importance in the evolution and architecture of the host, and their current classifications and nomenclatures. We will also address current methods and their limitations in identifying and classifying TEs.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales 170001, Colombia
- Department of Systems and Informatics, Universidad de Caldas, Manizales 170001, Colombia
| | - Gustavo Isaza
- Department of Systems and Informatics, Universidad de Caldas, Manizales 170001, Colombia
| | - Romain Guyot
- Department of Electronics and Automatization, Universidad Autónoma de Manizales, Manizales 170001, Colombia.
- Institut de Recherche pour le Développement, CIRAD, University Montpellier, 34000 Montpellier, France.
| |
Collapse
|
7
|
Pérez-Wohlfeil E, Diaz-Del-Pino S, Trelles O. Ultra-fast genome comparison for large-scale genomic experiments. Sci Rep 2019; 9:10274. [PMID: 31312019 PMCID: PMC6635410 DOI: 10.1038/s41598-019-46773-w] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Accepted: 06/07/2019] [Indexed: 01/23/2023] Open
Abstract
In the last decade, a technological shift in the bioinformatics field has occurred: larger genomes can now be sequenced quickly and cost effectively, resulting in the computational need to efficiently compare large and abundant sequences. Furthermore, detecting conserved similarities across large collections of genomes remains a problem. The size of chromosomes, along with the substantial amount of noise and number of repeats found in DNA sequences (particularly in mammals and plants), leads to a scenario where executing and waiting for complete outputs is both time and resource consuming. Filtering steps, manual examination and annotation, very long execution times and a high demand for computational resources represent a few of the many difficulties faced in large genome comparisons. In this work, we provide a method designed for comparisons of considerable amounts of very long sequences that employs a heuristic algorithm capable of separating noise and repeats from conserved fragments in pairwise genomic comparisons. We provide software implementation that computes in linear time using one core as a minimum and a small, constant memory footprint. The method produces both a previsualization of the comparison and a collection of indices to drastically reduce computational complexity when performing exhaustive comparisons. Last, the method scores the comparison to automate classification of sequences and produces a list of detected synteny blocks to enable new evolutionary studies.
Collapse
Affiliation(s)
- Esteban Pérez-Wohlfeil
- Computer Architecture Department, University of Málaga - Instituto de Investigación Biomédica de Málaga-IBIMA, Málaga, Spain
| | - Sergio Diaz-Del-Pino
- Computer Architecture Department, University of Málaga - Instituto de Investigación Biomédica de Málaga-IBIMA, Málaga, Spain
| | - Oswaldo Trelles
- Computer Architecture Department, University of Málaga - Instituto de Investigación Biomédica de Málaga-IBIMA, Málaga, Spain.
| |
Collapse
|