1
|
Ouedraogo WYDD, Ouangraoua A. SimSpliceEvol2: alternative splicing-aware simulation of biological sequence evolution and transcript phylogenies. BMC Bioinformatics 2024; 25:235. [PMID: 38992593 PMCID: PMC11238459 DOI: 10.1186/s12859-024-05853-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Accepted: 07/02/2024] [Indexed: 07/13/2024] Open
Abstract
BACKGROUND SimSpliceEvol is a tool for simulating the evolution of eukaryotic gene sequences that integrates exon-intron structure evolution as well as the evolution of the sets of transcripts produced from genes. It takes a guide gene tree as input and generates a gene sequence with its transcripts for each node of the tree, from the root to the leaves. However, the sets of transcripts simulated at different nodes of the guide gene tree lack evolutionary connections. Consequently, SimSpliceEvol is not suitable for evaluating methods for transcript phylogeny inference or gene phylogeny inference that rely on transcript conservation. RESULTS Here, we introduce SimSpliceEvol2, which, compared to the first version, incorporates an explicit model of transcript evolution for simulating alternative transcripts along the branches of a guide gene tree, as well as the transcript phylogenies inferred. We offer a comprehensive software with a graphical user interface and an updated version of the web server, ensuring easy and user-friendly access to the tool. CONCLUSION SimSpliceEvol2 generates synthetic datasets that are useful for evaluating methods and tools for spliced RNA sequence analysis, such as spliced alignment methods, methods for identifying conserved transcripts, and transcript phylogeny reconstruction methods. The web server is accessible at https://simspliceevol.cobius.usherbrooke.ca , where you can also download the standalone software. Comprehensive documentation for the software is available at the same address. For developers interested in the source code, which requires the installation of all prerequisites to run, it is provided at https://github.com/UdeS-CoBIUS/SimSpliceEvol .
Collapse
Affiliation(s)
- Wend Yam D D Ouedraogo
- Department of Computer Science, Université de Sherbrooke, 2500 Bd de l'université, Sherbrooke, QC, J1K2R1, Canada.
| | - Aida Ouangraoua
- Department of Computer Science, Université de Sherbrooke, 2500 Bd de l'université, Sherbrooke, QC, J1K2R1, Canada
| |
Collapse
|
2
|
Ouedraogo WYDD, Ouangraoua A. Orthology and Paralogy Relationships at Transcript Level. J Comput Biol 2024; 31:277-293. [PMID: 38621191 DOI: 10.1089/cmb.2023.0400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/17/2024] Open
Abstract
Eukaryotic genes undergo a mechanism called alternative processing, resulting in transcriptome diversity by allowing the production of multiple distinct transcripts from a gene. More than half of human genes are affected, and the resulting transcripts are highly conserved among orthologous genes of distinct species. In this work, we present the definition of orthology and paralogy between transcripts of homologous genes, together with an algorithm to compute clusters of conserved orthologous and paralogous transcripts. Gene-level homology relationships are utilized to define various types of homology relationships between transcripts originating from the same ancestral transcript. A Reciprocal Best Hits approach is employed to infer clusters of isoorthologous and recent paralogous transcripts. We applied this method to transcripts from simulated gene families as well as real gene families from the Ensembl-Compara database. The results are consistent with those from previous studies that compared orthologous gene transcripts. Furthermore, our findings provide evidence that searching for conserved transcripts between homologous genes, beyond the scope of orthologous genes, is likely to yield valuable information.
Collapse
Affiliation(s)
| | - Aida Ouangraoua
- Department of Computer Science, Université de Sherbrooke, Sherbrooke, Quebec, Canada
| |
Collapse
|
3
|
Jammali S, Djossou A, Ouédraogo WYDD, Nevers Y, Chegrane I, Ouangraoua A. From pairwise to multiple spliced alignment. BIOINFORMATICS ADVANCES 2022; 2:vbab044. [PMID: 36699392 PMCID: PMC9710695 DOI: 10.1093/bioadv/vbab044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 11/25/2021] [Indexed: 01/28/2023]
Abstract
Motivation Alternative splicing is a ubiquitous process in eukaryotes that allows distinct transcripts to be produced from the same gene. Yet, the study of transcript evolution within a gene family is still in its infancy. One prerequisite for this study is the availability of methods to compare sets of transcripts while accounting for their splicing structure. In this context, we generalize the concept of pairwise spliced alignments (PSpAs) to multiple spliced alignments (MSpAs). MSpAs have several important purposes in addition to empowering the study of the evolution of transcripts. For instance, it is a key to improving the prediction of gene models, which is important to solve the growing problem of genome annotation. Despite its essentialness, a formal definition of the concept and methods to compute MSpAs are still lacking. Results We introduce the MSpA problem and the SplicedFamAlignMulti (SFAM) method, to compute the MSpA of a gene family. Like most multiple sequence alignment (MSA) methods that are generally greedy heuristic methods assembling pairwise alignments, SFAM combines all PSpAs of coding DNA sequences and gene sequences of a gene family into an MSpA. It produces a single structure that represents the superstructure and models of the gene family. Using real vertebrate and simulated gene family data, we illustrate the utility of SFAM for computing accurate gene family superstructures, MSAs, inferring splicing orthologous groups and improving gene-model annotations. Availability and implementation The supporting data and implementation of SFAM are freely available at https://github.com/UdeS-CoBIUS/SpliceFamAlignMulti. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Safa Jammali
- Département D’informatique, Faculté des Sciences, Université de Sherbrooke, 2500, boul. de l'Université, Sherbrooke (Québec) J1K 2R1, Canada,Département de Biochimie et de Génomique Fonctionnelle, Faculté de Médecine et des Sciences de la santé, Université de Sherbrooke, 3001, 12e avenue Nord, Sherbrooke (Québec) J1H 5N4, Canada
| | - Abigaïl Djossou
- Département D’informatique, Faculté des Sciences, Université de Sherbrooke, 2500, boul. de l'Université, Sherbrooke (Québec) J1K 2R1, Canada
| | - Wend-Yam D D Ouédraogo
- Département D’informatique, Faculté des Sciences, Université de Sherbrooke, 2500, boul. de l'Université, Sherbrooke (Québec) J1K 2R1, Canada
| | - Yannis Nevers
- Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland,Department of Computational Biology, University of Lausanne, Lausanne 1015, Switzerland,Center for Integrative Genomics, University of Lausanne, Lausanne 1015, Switzerland
| | - Ibrahim Chegrane
- Département D’informatique, Faculté des Sciences, Université de Sherbrooke, 2500, boul. de l'Université, Sherbrooke (Québec) J1K 2R1, Canada
| | - Aïda Ouangraoua
- Département D’informatique, Faculté des Sciences, Université de Sherbrooke, 2500, boul. de l'Université, Sherbrooke (Québec) J1K 2R1, Canada,To whom correspondence should be addressed.
| |
Collapse
|
4
|
Ait-Hamlat A, Zea DJ, Labeeuw A, Polit L, Richard H, Laine E. Transcripts' Evolutionary History and Structural Dynamics Give Mechanistic Insights into the Functional Diversity of the JNK Family. J Mol Biol 2020; 432:2121-2140. [PMID: 32067951 DOI: 10.1016/j.jmb.2020.01.032] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 01/03/2020] [Accepted: 01/28/2020] [Indexed: 12/14/2022]
Abstract
Alternative splicing and alternative initiation/termination transcription sites have the potential to greatly expand the proteome in eukaryotes by producing several transcript isoforms from the same gene. Although these mechanisms are well described at the genomic level, little is known about their contribution to protein evolution and their impact at the protein structure level. Here, we address both issues by reconstructing the evolutionary history of transcripts and by modeling the tertiary structures of the corresponding protein isoforms. We reconstruct phylogenetic forests relating 60 protein-coding transcripts from the c-Jun N-terminal kinase (JNK) family observed in seven species. We identify two alternative splicing events of ancient origin and show that they induce subtle changes in the protein's structural dynamics. We highlight a previously uncharacterized transcript whose predicted structure seems stable in solution. We further demonstrate that orphan transcripts, for which no phylogeny could be reconstructed, display peculiar sequence and structural properties. Our approach is implemented in PhyloSofS (Phylogenies of Splicing Isoforms Structures), a fully automated computational tool freely available at https://github.com/PhyloSofS-Team/PhyloSofS.
Collapse
Affiliation(s)
- Adel Ait-Hamlat
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, 75005, France
| | - Diego Javier Zea
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, 75005, France
| | - Antoine Labeeuw
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, 75005, France
| | - Lélia Polit
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, 75005, France
| | - Hugues Richard
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, 75005, France.
| | - Elodie Laine
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, 75005, France.
| |
Collapse
|
5
|
Kuitche E, Jammali S, Ouangraoua A. SimSpliceEvol: alternative splicing-aware simulation of biological sequence evolution. BMC Bioinformatics 2019; 20:640. [PMID: 31842741 PMCID: PMC6916212 DOI: 10.1186/s12859-019-3207-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Background It is now well established that eukaryotic coding genes have the ability to produce more than one type of transcript thanks to the mechanisms of alternative splicing and alternative transcription. Because of the lack of gold standard real data on alternative splicing, simulated data constitute a good option for evaluating the accuracy and the efficiency of methods developed for splice-aware sequence analysis. However, existing sequence evolution simulation methods do not model alternative splicing, and so they can not be used to test spliced sequence analysis methods. Results We propose a new method called SimSpliceEvol for simulating the evolution of sets of alternative transcripts along the branches of an input gene tree. In addition to traditional sequence evolution events, the simulation also includes gene exon-intron structure evolution events and alternative splicing events that modify the sets of transcripts produced from genes. SimSpliceEvol was implemented in Python. The source code is freely available at https://github.com/UdeS-CoBIUS/SimSpliceEvol. Conclusions Data generated using SimSpliceEvol are useful for testing spliced RNA sequence analysis methods such as methods for spliced alignment of cDNA and genomic sequences, multiple cDNA alignment, orthologous exons identification, splicing orthology inference, transcript phylogeny inference, which requires to know the real evolutionary relationships between the sequences.
Collapse
Affiliation(s)
- Esaie Kuitche
- Department of Computer Science, University of Sherbrooke, 2500 Boulevard de l'Université, Quebec, J1K2R1, Canada.
| | - Safa Jammali
- Department of Computer Science, University of Sherbrooke, 2500 Boulevard de l'Université, Quebec, J1K2R1, Canada.,Department of Biochemistry, University of Sherbrooke, 3001 12e avenue Nord, Quebec, J1H5N4, Canada
| | - Aïda Ouangraoua
- Department of Computer Science, University of Sherbrooke, 2500 Boulevard de l'Université, Quebec, J1K2R1, Canada
| |
Collapse
|
6
|
Jammali S, Kuitche E, Rachati A, Bélanger F, Scott M, Ouangraoua A. Aligning coding sequences with frameshift extension penalties. Algorithms Mol Biol 2017; 12:10. [PMID: 28373895 PMCID: PMC5374649 DOI: 10.1186/s13015-017-0101-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2016] [Accepted: 03/18/2017] [Indexed: 11/15/2022] Open
Abstract
BACKGROUND Frameshift translation is an important phenomenon that contributes to the appearance of novel coding DNA sequences (CDS) and functions in gene evolution, by allowing alternative amino acid translations of gene coding regions. Frameshift translations can be identified by aligning two CDS, from a same gene or from homologous genes, while accounting for their codon structure. Two main classes of algorithms have been proposed to solve the problem of aligning CDS, either by amino acid sequence alignment back-translation, or by simultaneously accounting for the nucleotide and amino acid levels. The former does not allow to account for frameshift translations and up to now, the latter exclusively accounts for frameshift translation initiation, not considering the length of the translation disruption caused by a frameshift. RESULTS We introduce a new scoring scheme with an algorithm for the pairwise alignment of CDS accounting for frameshift translation initiation and length, while simultaneously considering nucleotide and amino acid sequences. The main specificity of the scoring scheme is the introduction of a penalty cost accounting for frameshift extension length to compute an adequate similarity score for a CDS alignment. The second specificity of the model is that the search space of the problem solved is the set of all feasible alignments between two CDS. Previous approaches have considered restricted search space or additional constraints on the decomposition of an alignment into length-3 sub-alignments. The algorithm described in this paper has the same asymptotic time complexity as the classical Needleman-Wunsch algorithm. CONCLUSIONS We compare the method to other CDS alignment methods based on an application to the comparison of pairs of CDS from homologous human, mouse and cow genes of ten mammalian gene families from the Ensembl-Compara database. The results show that our method is particularly robust to parameter changes as compared to existing methods. It also appears to be a good compromise, performing well both in the presence and absence of frameshift translations. An implementation of the method is available at https://github.com/UdeS-CoBIUS/FsePSA.
Collapse
Affiliation(s)
- Safa Jammali
- Département d’informatique, Faculté des Sciences, Université de Sherbrooke, Sherbrooke, QC J1K2R1 Canada
| | - Esaie Kuitche
- Département d’informatique, Faculté des Sciences, Université de Sherbrooke, Sherbrooke, QC J1K2R1 Canada
| | - Ayoub Rachati
- Département d’informatique, Faculté des Sciences, Université de Sherbrooke, Sherbrooke, QC J1K2R1 Canada
| | - François Bélanger
- Département d’informatique, Faculté des Sciences, Université de Sherbrooke, Sherbrooke, QC J1K2R1 Canada
| | - Michelle Scott
- Département de biochimie, Faculté de médecine et des sciences de la santé, Université de Sherbrooke, Sherbrooke, QC J1E4K8 Canada
| | - Aïda Ouangraoua
- Département d’informatique, Faculté des Sciences, Université de Sherbrooke, Sherbrooke, QC J1K2R1 Canada
| |
Collapse
|