1
|
Lee H, Ozbulak U, Park H, Depuydt S, De Neve W, Vankerschaver J. Assessing the reliability of point mutation as data augmentation for deep learning with genomic data. BMC Bioinformatics 2024; 25:170. [PMID: 38689247 PMCID: PMC11059627 DOI: 10.1186/s12859-024-05787-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 04/15/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND Deep neural networks (DNNs) have the potential to revolutionize our understanding and treatment of genetic diseases. An inherent limitation of deep neural networks, however, is their high demand for data during training. To overcome this challenge, other fields, such as computer vision, use various data augmentation techniques to artificially increase the available training data for DNNs. Unfortunately, most data augmentation techniques used in other domains do not transfer well to genomic data. RESULTS Most genomic data possesses peculiar properties and data augmentations may significantly alter the intrinsic properties of the data. In this work, we propose a novel data augmentation technique for genomic data inspired by biology: point mutations. By employing point mutations as substitutes for codons, we demonstrate that our newly proposed data augmentation technique enhances the performance of DNNs across various genomic tasks that involve coding regions, such as translation initiation and splice site detection. CONCLUSION Silent and missense mutations are found to positively influence effectiveness, while nonsense mutations and random mutations in non-coding regions generally lead to degradation. Overall, point mutation-based augmentations in genomic datasets present valuable opportunities for improving the accuracy and reliability of predictive models for DNA sequences.
Collapse
Affiliation(s)
| | - Utku Ozbulak
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
| | - Homin Park
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Stephen Depuydt
- Erasmus Brussels University of Applied Sciences and Arts, Brussels, Belgium
| | - Wesley De Neve
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Joris Vankerschaver
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea.
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium.
| |
Collapse
|
2
|
Liu X, Zhang H, Zeng Y, Zhu X, Zhu L, Fu J. DRANetSplicer: A Splice Site Prediction Model Based on Deep Residual Attention Networks. Genes (Basel) 2024; 15:404. [PMID: 38674339 PMCID: PMC11048956 DOI: 10.3390/genes15040404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 03/20/2024] [Accepted: 03/23/2024] [Indexed: 04/28/2024] Open
Abstract
The precise identification of splice sites is essential for unraveling the structure and function of genes, constituting a pivotal step in the gene annotation process. In this study, we developed a novel deep learning model, DRANetSplicer, that integrates residual learning and attention mechanisms for enhanced accuracy in capturing the intricate features of splice sites. We constructed multiple datasets using the most recent versions of genomic data from three different organisms, Oryza sativa japonica, Arabidopsis thaliana and Homo sapiens. This approach allows us to train models with a richer set of high-quality data. DRANetSplicer outperformed benchmark methods on donor and acceptor splice site datasets, achieving an average accuracy of (96.57%, 95.82%) across the three organisms. Comparative analyses with benchmark methods, including SpliceFinder, Splice2Deep, Deep Splicer, EnsembleSplice, and DNABERT, revealed DRANetSplicer's superior predictive performance, resulting in at least a (4.2%, 11.6%) relative reduction in average error rate. We utilized the DRANetSplicer model trained on O. sativa japonica data to predict splice sites in A. thaliana, achieving accuracies for donor and acceptor sites of (94.89%, 94.25%). These results indicate that DRANetSplicer possesses excellent cross-organism predictive capabilities, with its performance in cross-organism predictions even surpassing that of benchmark methods in non-cross-organism predictions. Cross-organism validation showcased DRANetSplicer's excellence in predicting splice sites across similar organisms, supporting its applicability in gene annotation for understudied organisms. We employed multiple methods to visualize the decision-making process of the model. The visualization results indicate that DRANetSplicer can learn and interpret well-known biological features, further validating its overall performance. Our study systematically examined and confirmed the predictive ability of DRANetSplicer from various levels and perspectives, indicating that its practical application in gene annotation is justified.
Collapse
Affiliation(s)
- Xueyan Liu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Hongyan Zhang
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Ying Zeng
- School of Computer and Communication, Hunan Institute of Engineering, Xiangtan 411104, China;
| | - Xinghui Zhu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Lei Zhu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Jiahui Fu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| |
Collapse
|
3
|
Chao KH, Mao A, Salzberg SL, Pertea M. Splam: a deep-learning-based splice site predictor that improves spliced alignments. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.27.550754. [PMID: 37546880 PMCID: PMC10402160 DOI: 10.1101/2023.07.27.550754] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
The process of splicing messenger RNA to remove introns plays a central role in creating genes and gene variants. Here we describe Splam, a novel method for predicting splice junctions in DNA based on deep residual convolutional neural networks. Unlike some previous models, Splam looks at a relatively limited window of 400 base pairs flanking each splice site, motivated by the observation that the biological process of splicing relies primarily on signals within this window. Additionally, Splam introduces the idea of training the network on donor and acceptor pairs together, based on the principle that the splicing machinery recognizes both ends of each intron at once. We compare Splam's accuracy to recent state-of-the-art splice site prediction methods, particularly SpliceAI, another method that uses deep neural networks. Our results show that Splam is consistently more accurate than SpliceAI, with an overall accuracy of 96% at predicting human splice junctions. Splam generalizes even to non-human species, including distant ones like the flowering plant Arabidopsis thaliana. Finally, we demonstrate the use of Splam on a novel application: processing the spliced alignments of RNA-seq data to identify and eliminate errors. We show that when used in this manner, Splam yields substantial improvements in the accuracy of downstream transcriptome analysis of both poly(A) and ribo-depleted RNA-seq libraries. Overall, Splam offers a faster and more accurate approach to detecting splice junctions, while also providing a reliable and efficient solution for cleaning up erroneous spliced alignments.
Collapse
Affiliation(s)
- Kuan-Hao Chao
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Alan Mao
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Steven L Salzberg
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21211, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
4
|
Zabardast A, Tamer EG, Son YA, Yılmaz A. An automated framework for evaluation of deep learning models for splice site predictions. Sci Rep 2023; 13:10221. [PMID: 37353532 PMCID: PMC10290104 DOI: 10.1038/s41598-023-34795-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Accepted: 05/08/2023] [Indexed: 06/25/2023] Open
Abstract
A novel framework for the automated evaluation of various deep learning-based splice site detectors is presented. The framework eliminates time-consuming development and experimenting activities for different codebases, architectures, and configurations to obtain the best models for a given RNA splice site dataset. RNA splicing is a cellular process in which pre-mRNAs are processed into mature mRNAs and used to produce multiple mRNA transcripts from a single gene sequence. Since the advancement of sequencing technologies, many splice site variants have been identified and associated with the diseases. So, RNA splice site prediction is essential for gene finding, genome annotation, disease-causing variants, and identification of potential biomarkers. Recently, deep learning models performed highly accurately for classifying genomic signals. Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) and its bidirectional version (BLSTM), Gated Recurrent Unit (GRU), and its bidirectional version (BGRU) are promising models. During genomic data analysis, CNN's locality feature helps where each nucleotide correlates with other bases in its vicinity. In contrast, BLSTM can be trained bidirectionally, allowing sequential data to be processed from forward and reverse directions. Therefore, it can process 1-D encoded genomic data effectively. Even though both methods have been used in the literature, a performance comparison was missing. To compare selected models under similar conditions, we have created a blueprint for a series of networks with five different levels. As a case study, we compared CNN and BLSTM models' learning capabilities as building blocks for RNA splice site prediction in two different datasets. Overall, CNN performed better with [Formula: see text] accuracy ([Formula: see text] improvement), [Formula: see text] F1 score ([Formula: see text] improvement), and [Formula: see text] AUC-PR ([Formula: see text] improvement) in human splice site prediction. Likewise, an outperforming performance with [Formula: see text] accuracy ([Formula: see text] improvement), [Formula: see text] F1 score ([Formula: see text] improvement), and [Formula: see text] AUC-PR ([Formula: see text] improvement) is achieved in C. elegans splice site prediction. Overall, our results showed that CNN learns faster than BLSTM and BGRU. Moreover, CNN performs better at extracting sequence patterns than BLSTM and BGRU. To our knowledge, no other framework is developed explicitly for evaluating splice detection models to decide the best possible model in an automated manner. So, the proposed framework and the blueprint would help selecting different deep learning models, such as CNN vs. BLSTM and BGRU, for splice site analysis or similar classification tasks and in different problems.
Collapse
Affiliation(s)
- Amin Zabardast
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Elif Güney Tamer
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Yeşim Aydın Son
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Arif Yılmaz
- Institute of Data Science, Maastricht University, Maastricht, The Netherlands.
| |
Collapse
|
5
|
Liu Q, Fang H, Wang X, Wang M, Li S, Coin LJM, Li F, Song J. DeepGenGrep: a general deep learning-based predictor for multiple genomic signals and regions. Bioinformatics 2022; 38:4053-4061. [PMID: 35799358 DOI: 10.1093/bioinformatics/btac454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 04/11/2022] [Accepted: 07/06/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Accurate annotation of different genomic signals and regions (GSRs) from DNA sequences is fundamentally important for understanding gene structure, regulation and function. Numerous efforts have been made to develop machine learning-based predictors for in silico identification of GSRs. However, it remains a great challenge to identify GSRs as the performance of most existing approaches is unsatisfactory. As such, it is highly desirable to develop more accurate computational methods for GSRs prediction. RESULTS In this study, we propose a general deep learning framework termed DeepGenGrep, a general predictor for the systematic identification of multiple different GSRs from genomic DNA sequences. DeepGenGrep leverages the power of hybrid neural networks comprising a three-layer convolutional neural network and a two-layer long short-term memory to effectively learn useful feature representations from sequences. Benchmarking experiments demonstrate that DeepGenGrep outperforms several state-of-the-art approaches on identifying polyadenylation signals, translation initiation sites and splice sites across four eukaryotic species including Homo sapiens, Mus musculus, Bos taurus and Drosophila melanogaster. Overall, DeepGenGrep represents a useful tool for the high-throughput and cost-effective identification of potential GSRs in eukaryotic genomes. AVAILABILITY AND IMPLEMENTATION The webserver and source code are freely available at http://bigdata.biocie.cn/deepgengrep/home and Github (https://github.com/wx-cie/DeepGenGrep/). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Honglin Fang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Xiao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Lachlan J M Coin
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC 3000, Australia
| | - Fuyi Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China.,Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC 3000, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
6
|
Scalzitti N, Kress A, Orhand R, Weber T, Moulinier L, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. Spliceator: multi-species splice site prediction using convolutional neural networks. BMC Bioinformatics 2021; 22:561. [PMID: 34814826 PMCID: PMC8609763 DOI: 10.1186/s12859-021-04471-3] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Accepted: 11/09/2021] [Indexed: 12/14/2022] Open
Abstract
Background Ab initio prediction of splice sites is an essential step in eukaryotic genome annotation. Recent predictors have exploited Deep Learning algorithms and reliable gene structures from model organisms. However, Deep Learning methods for non-model organisms are lacking. Results We developed Spliceator to predict splice sites in a wide range of species, including model and non-model organisms. Spliceator uses a convolutional neural network and is trained on carefully validated data from over 100 organisms. We show that Spliceator achieves consistently high accuracy (89–92%) compared to existing methods on independent benchmarks from human, fish, fly, worm, plant and protist organisms. Conclusions Spliceator is a new Deep Learning method trained on high-quality data, which can be used to predict splice sites in diverse organisms, ranging from human to protists, with consistently high accuracy. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04471-3.
Collapse
Affiliation(s)
- Nicolas Scalzitti
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Arnaud Kress
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.,BiGEst-ICube Platform, ICube Laboratory, UMR7357, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Romain Orhand
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Thomas Weber
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Luc Moulinier
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.,BiGEst-ICube Platform, ICube Laboratory, UMR7357, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Anne Jeannin-Girardon
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Pierre Collet
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Olivier Poch
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Julie D Thompson
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.
| |
Collapse
|
7
|
De Vos S, Rombauts S, Coussement L, Dermauw W, Vuylsteke M, Sorgeloos P, Clegg JS, Nambu Z, Van Nieuwerburgh F, Norouzitallab P, Van Leeuwen T, De Meyer T, Van Stappen G, Van de Peer Y, Bossier P. The genome of the extremophile Artemia provides insight into strategies to cope with extreme environments. BMC Genomics 2021; 22:635. [PMID: 34465293 PMCID: PMC8406910 DOI: 10.1186/s12864-021-07937-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Accepted: 08/14/2021] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Brine shrimp Artemia have an unequalled ability to endure extreme salinity and complete anoxia. This study aims to elucidate its strategies to cope with these stressors. RESULTS AND DISCUSSION Here, we present the genome of an inbred A. franciscana Kellogg, 1906. We identified 21,828 genes of which, under high salinity, 674 genes and under anoxia, 900 genes were differentially expressed (42%, respectively 30% were annotated). Under high salinity, relevant stress genes and pathways included several Heat Shock Protein and Leaf Embryogenesis Abundant genes, as well as the trehalose metabolism. In addition, based on differential gene expression analysis, it can be hypothesized that a high oxidative stress response and endocytosis/exocytosis are potential salt management strategies, in addition to the expression of major facilitator superfamily genes responsible for transmembrane ion transport. Under anoxia, genes involved in mitochondrial function, mTOR signalling and autophagy were differentially expressed. Both high salt and anoxia enhanced degradation of erroneous proteins and protein chaperoning. Compared with other branchiopod genomes, Artemia had 0.03% contracted and 6% expanded orthogroups, in which 14% of the genes were differentially expressed under high salinity or anoxia. One phospholipase D gene family, shown to be important in plant stress response, was uniquely present in both extremophiles Artemia and the tardigrade Hypsibius dujardini, yet not differentially expressed under the described experimental conditions. CONCLUSIONS A relatively complete genome of Artemia was assembled, annotated and analysed, facilitating research on its extremophile features, and providing a reference sequence for crustacean research.
Collapse
Affiliation(s)
- Stephanie De Vos
- Laboratory of Aquaculture & Artemia Reference Center, Department of Animal Sciences and Aquatic Ecology, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
- Department of Plant Systems Biology, VIB, Department of Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
| | - Stephane Rombauts
- Department of Plant Systems Biology, VIB, Department of Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
| | - Louis Coussement
- Department of Data Analysis and Mathematical Modelling, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | - Wannes Dermauw
- Department of Plants and Crops, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | | | - Patrick Sorgeloos
- Laboratory of Aquaculture & Artemia Reference Center, Department of Animal Sciences and Aquatic Ecology, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | - James S Clegg
- Coastal and Marine Sciences Institute, University of California, Bodega Bay, Davis, CA, USA
| | - Ziro Nambu
- Department of Medical Technology, School of Health Sciences, University of Occupational and Environmental Health, Japan, Kitakyushu, Fukuoka, Japan
| | - Filip Van Nieuwerburgh
- Department of Pharmaceutics, Faculty of Pharmaceutical Sciences, Ghent University, Ghent, Belgium
| | - Parisa Norouzitallab
- Laboratory of Aquaculture & Artemia Reference Center, Department of Animal Sciences and Aquatic Ecology, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
- Laboratory for Immunology and Animal Biotechnology, Department of Animal Sciences and Aquatic Ecology, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | - Thomas Van Leeuwen
- Department of Plants and Crops, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | - Tim De Meyer
- Department of Data Analysis and Mathematical Modelling, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | - Gilbert Van Stappen
- Laboratory of Aquaculture & Artemia Reference Center, Department of Animal Sciences and Aquatic Ecology, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | - Yves Van de Peer
- Department of Plant Systems Biology, VIB, Department of Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- Centre for Microbial Ecology and Genomics, Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, South Africa
| | - Peter Bossier
- Laboratory of Aquaculture & Artemia Reference Center, Department of Animal Sciences and Aquatic Ecology, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium.
| |
Collapse
|
8
|
Dutta A, Singh KK, Anand A. SpliceViNCI: Visualizing the splicing of non-canonical introns through recurrent neural networks. J Bioinform Comput Biol 2021; 19:2150014. [PMID: 34088258 DOI: 10.1142/s0219720021500141] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Most of the current computational models for splice junction prediction are based on the identification of canonical splice junctions. However, it is observed that the junctions lacking the consensus dimers GT and AG also undergo splicing. Identification of such splice junctions, called the non-canonical splice junctions, is also essential for a comprehensive understanding of the splicing phenomenon. This work focuses on the identification of non-canonical splice junctions through the application of a bidirectional long short-term memory (BLSTM) network. Furthermore, we apply a back-propagation-based (integrated gradient) and a perturbation-based (occlusion) visualization techniques to extract the non-canonical splicing features learned by the model. The features obtained are validated with the existing knowledge from the literature. Integrated gradient extracts features that comprise contiguous nucleotides, whereas occlusion extracts features that are individual nucleotides distributed across the sequence.
Collapse
Affiliation(s)
- Aparajita Dutta
- Department of CSE, Indian Institute of Technology, Guwahati, India
| | | | - Ashish Anand
- Department of CSE, Indian Institute of Technology, Guwahati, India
| |
Collapse
|
9
|
Amilpur S, Bhukya R. EDeepSSP: Explainable deep neural networks for exact splice sites prediction. J Bioinform Comput Biol 2020; 18:2050024. [PMID: 32696716 DOI: 10.1142/s0219720020500249] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Splice site prediction is crucial for understanding underlying gene regulation, gene function for better genome annotation. Many computational methods exist for recognizing the splice sites. Although most of the methods achieve a competent performance, their interpretability remains challenging. Moreover, all traditional machine learning methods manually extract features, which is tedious job. To address these challenges, we propose a deep learning-based approach (EDeepSSP) that employs convolutional neural networks (CNNs) architecture for automatic feature extraction and effectively predicts splice sites. Our model, EDeepSSP, divulges the opaque nature of CNN by extracting significant motifs and explains why these motifs are vital for predicting splice sites. In this study, experiments have been conducted on six benchmark acceptors and donor datasets of humans, cress, and fly. The results show that EDeepSSP has outperformed many state-of-the-art approaches. EDeepSSP achieves the highest area under the receiver operating characteristic curve (AUC_ROC) and area under the precision-recall curve (AUC_PR) of 99.32% and 99.26% on human donor datasets, respectively. We also analyze various filter activities, feature activations, and extracted significant motifs responsible for the splice site prediction. Further, we validate the learned motifs of our model against known motifs of JASPAR splice site database.
Collapse
Affiliation(s)
- Santhosh Amilpur
- Computer Science and Engineering, National Institute of Technology Warangal, Warangal, Telangana 506004, India
| | - Raju Bhukya
- Computer Science and Engineering, National Institute of Technology Warangal, Warangal, Telangana 506004, India
| |
Collapse
|
10
|
Payrovnaziri SN, Chen Z, Rengifo-Moreno P, Miller T, Bian J, Chen JH, Liu X, He Z. Explainable artificial intelligence models using real-world electronic health record data: a systematic scoping review. J Am Med Inform Assoc 2020; 27:1173-1185. [PMID: 32417928 PMCID: PMC7647281 DOI: 10.1093/jamia/ocaa053] [Citation(s) in RCA: 111] [Impact Index Per Article: 22.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Revised: 04/01/2020] [Accepted: 04/07/2020] [Indexed: 01/08/2023] Open
Abstract
OBJECTIVE To conduct a systematic scoping review of explainable artificial intelligence (XAI) models that use real-world electronic health record data, categorize these techniques according to different biomedical applications, identify gaps of current studies, and suggest future research directions. MATERIALS AND METHODS We searched MEDLINE, IEEE Xplore, and the Association for Computing Machinery (ACM) Digital Library to identify relevant papers published between January 1, 2009 and May 1, 2019. We summarized these studies based on the year of publication, prediction tasks, machine learning algorithm, dataset(s) used to build the models, the scope, category, and evaluation of the XAI methods. We further assessed the reproducibility of the studies in terms of the availability of data and code and discussed open issues and challenges. RESULTS Forty-two articles were included in this review. We reported the research trend and most-studied diseases. We grouped XAI methods into 5 categories: knowledge distillation and rule extraction (N = 13), intrinsically interpretable models (N = 9), data dimensionality reduction (N = 8), attention mechanism (N = 7), and feature interaction and importance (N = 5). DISCUSSION XAI evaluation is an open issue that requires a deeper focus in the case of medical applications. We also discuss the importance of reproducibility of research work in this field, as well as the challenges and opportunities of XAI from 2 medical professionals' point of view. CONCLUSION Based on our review, we found that XAI evaluation in medicine has not been adequately and formally practiced. Reproducibility remains a critical concern. Ample opportunities exist to advance XAI research in medicine.
Collapse
Affiliation(s)
| | - Zhaoyi Chen
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Pablo Rengifo-Moreno
- College of Medicine, Florida State University, Tallahassee, Florida, USA
- Tallahassee Memorial Hospital, Tallahassee, Florida, USA
| | - Tim Miller
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Jonathan H Chen
- Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, California, USA
- Division of Hospital Medicine, Department of Medicine, Stanford University, Stanford, California, USA
| | - Xiuwen Liu
- Department of Computer Science, Florida State University, Tallahassee, Florida, USA
| | - Zhe He
- School of Information, Florida State University, Tallahassee, Florida, USA
| |
Collapse
|
11
|
Albaradei S, Magana-Mora A, Thafar M, Uludag M, Bajic VB, Gojobori T, Essack M, Jankovic BR. Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA. Gene 2020; 763S:100035. [PMID: 32550561 PMCID: PMC7285987 DOI: 10.1016/j.gene.2020.100035] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Accepted: 05/06/2020] [Indexed: 12/21/2022]
Abstract
Background The accurate identification of the exon/intron boundaries is critical for the correct annotation of genes with multiple exons. Donor and acceptor splice sites (SS) demarcate these boundaries. Therefore, deriving accurate computational models to predict the SS are useful for functional annotation of genes and genomes, and for finding alternative SS associated with different diseases. Although various models have been proposed for the in silico prediction of SS, improving their accuracy is required for reliable annotation. Moreover, models are often derived and tested using the same genome, providing no evidence of broad application, i.e. to other poorly studied genomes. Results With this in mind, we developed the Splice2Deep models for SS detection. Each model is an ensemble of deep convolutional neural networks. We evaluated the performance of the models based on the ability to detect SS in Homo sapiens, Oryza sativa japonica, Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans. Results demonstrate that the models efficiently detect SS in other organisms not considered during the training of the models. Compared to the state-of-the-art tools, Splice2Deep models achieved significantly reduced average error rates of 41.97% and 28.51% for acceptor and donor SS, respectively. Moreover, the Splice2Deep cross-organism validation demonstrates that models correctly identify conserved genomic elements enabling annotation of SS in new genomes by choosing the taxonomically closest model. Conclusions The results of our study demonstrated that Splice2Deep both achieved a considerably reduced error rate compared to other state-of-the-art models and the ability to accurately recognize SS in other organisms for which the model was not trained, enabling annotation of poorly studied or newly sequenced genomes. Splice2Deep models are implemented in Python using Keras API; the models and the data are available at https://github.com/SomayahAlbaradei/Splice_Deep.git.
Collapse
Key Words
- AUC, area under curve
- AcSS, acceptor splice site
- Acc, accuracy
- Bioinformatics
- CNN, convolutional neural network
- CONV, convolutional layers
- DL, deep learning
- DNA, deoxyribonucleic acid
- DT, decision trees
- Deep-learning
- DoSS, donor splice site
- FC, fully connected layer
- ML, machine learning
- NB, naive Bayes
- NN, neural network
- POOL, pooling layer
- Prediction
- RF, random forest
- RNA, ribonucleic acid
- ReLU, rectified linear unit layer
- SS, splice site
- SVM, support vector machine
- Sn, sensitivity
- Sp, specificity
- Splice sites
- Splicing
Collapse
Affiliation(s)
- Somayah Albaradei
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Faculty of Computing and Information Technology, King Abdulaziz University, Saudi Arabia
| | - Arturo Magana-Mora
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Saudi Aramco, EXPEC-ARC, Drilling Technology Team, Dhahran 31311, Saudi Arabia
| | - Maha Thafar
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Faculty of Computers and Information Systems, Taif University, Saudi Arabia
| | - Mahmut Uludag
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Takashi Gojobori
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Biological and Environmental Sciences and Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Magbubah Essack
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Boris R Jankovic
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
12
|
Van Messem A. Support vector machines: A robust prediction method with applications in bioinformatics. HANDBOOK OF STATISTICS 2020. [DOI: 10.1016/bs.host.2019.08.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
|
13
|
Wang R, Wang Z, Wang J, Li S. SpliceFinder: ab initio prediction of splice sites using convolutional neural network. BMC Bioinformatics 2019; 20:652. [PMID: 31881982 PMCID: PMC6933889 DOI: 10.1186/s12859-019-3306-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Background Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing. Result We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining. Conclusion Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.
Collapse
Affiliation(s)
- Ruohan Wang
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China
| | - Zishuai Wang
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China
| | - Jianping Wang
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China.
| | - Shuaicheng Li
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China.
| |
Collapse
|
14
|
Linsmith G, Rombauts S, Montanari S, Deng CH, Celton JM, Guérif P, Liu C, Lohaus R, Zurn JD, Cestaro A, Bassil NV, Bakker LV, Schijlen E, Gardiner SE, Lespinasse Y, Durel CE, Velasco R, Neale DB, Chagné D, Van de Peer Y, Troggio M, Bianco L. Pseudo-chromosome-length genome assembly of a double haploid "Bartlett" pear (Pyrus communis L.). Gigascience 2019; 8:giz138. [PMID: 31816089 PMCID: PMC6901071 DOI: 10.1093/gigascience/giz138] [Citation(s) in RCA: 68] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Revised: 10/18/2019] [Accepted: 10/30/2019] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND We report an improved assembly and scaffolding of the European pear (Pyrus communis L.) genome (referred to as BartlettDHv2.0), obtained using a combination of Pacific Biosciences RSII long-read sequencing, Bionano optical mapping, chromatin interaction capture (Hi-C), and genetic mapping. The sample selected for sequencing is a double haploid derived from the same "Bartlett" reference pear that was previously sequenced. Sequencing of di-haploid plants makes assembly more tractable in highly heterozygous species such as P. communis. FINDINGS A total of 496.9 Mb corresponding to 97% of the estimated genome size were assembled into 494 scaffolds. Hi-C data and a high-density genetic map allowed us to anchor and orient 87% of the sequence on the 17 pear chromosomes. Approximately 50% (247 Mb) of the genome consists of repetitive sequences. Gene annotation confirmed the presence of 37,445 protein-coding genes, which is 13% fewer than previously predicted. CONCLUSIONS We showed that the use of a doubled-haploid plant is an effective solution to the problems presented by high levels of heterozygosity and duplication for the generation of high-quality genome assemblies. We present a high-quality chromosome-scale assembly of the European pear Pyrus communis and demostrate its high degree of synteny with the genomes of Malus x Domestica and Pyrus x bretschneideri.
Collapse
Affiliation(s)
- Gareth Linsmith
- Center for Plant Systems Biology, VIB, Technologiepark 71, 9052, Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, 9052 Gent, Belgium
- Fondazione Edmund Mach, via E. Mach 1, 38010, San Michele all'Adige (TN), Italy
| | - Stephane Rombauts
- Center for Plant Systems Biology, VIB, Technologiepark 71, 9052, Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, 9052 Gent, Belgium
| | - Sara Montanari
- University of California Davis, Department of Plant Sciences, One Shields Ave, Davis, CA 95616, USA
| | - Cecilia H Deng
- The New Zealand Institute for Plant & Food Research Limited (PFR), Mt Albert Research Centre,120 Mt Albert Road, Sandringham, Auckland, 1025, New Zealand
| | - Jean-Marc Celton
- IRHS, INRA, Agrocampus-Ouest, Université d'Angers, SFR 4207 Quasav, 42 rue Georges Morel, F-49071 Beaucouzé, France
| | - Philippe Guérif
- IRHS, INRA, Agrocampus-Ouest, Université d'Angers, SFR 4207 Quasav, 42 rue Georges Morel, F-49071 Beaucouzé, France
| | - Chang Liu
- ZMBP, Allgemeine Genetik, Universität Tübingen, Auf der Morgenstelle 32, D-72076 Tübingen, Germany
| | - Rolf Lohaus
- Center for Plant Systems Biology, VIB, Technologiepark 71, 9052, Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, 9052 Gent, Belgium
| | - Jason D Zurn
- USDA-ARS National Clonal Germplasm Repository, 33447 Peoria Road, Corvallis, OR 97333, USA
| | - Alessandro Cestaro
- Fondazione Edmund Mach, via E. Mach 1, 38010, San Michele all'Adige (TN), Italy
| | - Nahla V Bassil
- USDA-ARS National Clonal Germplasm Repository, 33447 Peoria Road, Corvallis, OR 97333, USA
| | - Linda V Bakker
- Wageningen UR – Bioscience P.O. Box 16, 6700AA, Wageningen, The Netherlands
| | - Elio Schijlen
- Wageningen UR – Bioscience P.O. Box 16, 6700AA, Wageningen, The Netherlands
| | - Susan E Gardiner
- The New Zealand Institute for Plant & Food Research Limited (PFR), Palmerston North Research Centre, Palmerston North, New Zealand
| | - Yves Lespinasse
- IRHS, INRA, Agrocampus-Ouest, Université d'Angers, SFR 4207 Quasav, 42 rue Georges Morel, F-49071 Beaucouzé, France
| | - Charles-Eric Durel
- IRHS, INRA, Agrocampus-Ouest, Université d'Angers, SFR 4207 Quasav, 42 rue Georges Morel, F-49071 Beaucouzé, France
| | - Riccardo Velasco
- CREA Research Centre for Viticulture and Enology, Via XXVIII Aprile 26, 31015 Conegliano (TV), Italy
| | - David B Neale
- University of California Davis, Department of Plant Sciences, One Shields Ave, Davis, CA 95616, USA
| | - David Chagné
- The New Zealand Institute for Plant & Food Research Limited (PFR), Palmerston North Research Centre, Palmerston North, New Zealand
| | - Yves Van de Peer
- Center for Plant Systems Biology, VIB, Technologiepark 71, 9052, Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, 9052 Gent, Belgium
- Center for Microbial Ecology and Genomics, Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Roper street, Pretoria 0028, South Africa
| | - Michela Troggio
- Fondazione Edmund Mach, via E. Mach 1, 38010, San Michele all'Adige (TN), Italy
| | - Luca Bianco
- Fondazione Edmund Mach, via E. Mach 1, 38010, San Michele all'Adige (TN), Italy
| |
Collapse
|
15
|
Using the Chou's 5-steps rule to predict splice junctions with interpretable bidirectional long short-term memory networks. Comput Biol Med 2019; 116:103558. [PMID: 31783254 DOI: 10.1016/j.compbiomed.2019.103558] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2019] [Revised: 11/17/2019] [Accepted: 11/18/2019] [Indexed: 11/21/2022]
Abstract
Neural models have been able to obtain state-of-the-art performances on several genome sequence-based prediction tasks. Such models take only nucleotide sequences as input and learn relevant features on their own. However, extracting the interpretable motifs from the model remains a challenge. This work explores various existing visualization techniques in their ability to infer relevant sequence information learnt by a recurrent neural network (RNN) on the task of splice junction identification. The visualization techniques have been modulated to suit the genome sequences as input. The visualizations inspect genomic regions at the level of a single nucleotide as well as a span of consecutive nucleotides. This inspection is performed based on the modification of input sequences (perturbation based) or the embedding space (back-propagation based). We infer features pertaining to both canonical and non-canonical splicing from a single neural model. Results indicate that the visualization techniques produce comparable performances for branchpoint detection. However, in the case of canonical donor and acceptor junction motifs, perturbation based visualizations perform better than back-propagation based visualizations, and vice-versa for non-canonical motifs. The source code of our stand-alone SpliceVisuL tool is available at https://github.com/aaiitggrp/SpliceVisuL.
Collapse
|
16
|
Zeng Y, Yuan H, Yuan Z, Chen Y. A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples. Biol Direct 2019; 14:6. [PMID: 30975175 PMCID: PMC6460831 DOI: 10.1186/s13062-019-0236-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2018] [Accepted: 03/18/2019] [Indexed: 11/10/2022] Open
Abstract
Background Splice sites prediction has been a long-standing problem in bioinformatics. Although many computational approaches developed for splice site prediction have achieved satisfactory accuracy, further improvement in predictive accuracy is significant, for it is contributing to predict gene structure more accurately. Determining a proper window size before prediction is necessary. Overly long window size may introduce some irrelevant features, which would reduce predictive accuracy, while the use of short window size with maximum information may performs better in terms of predictive accuracy and time cost. Furthermore, the number of false splice sites following the GT–AG rule far exceeds that of true splice sites, accurate and rapid prediction of splice sites using imbalanced large samples has always been a challenge. Therefore, based on the short window size and imbalanced large samples, we developed a new computational method named chi-square decision table (χ2-DT) for donor splice site prediction. Results Using a short window size of 11 bp, χ2-DT extracts the improved positional features and compositional features based on chi-square test, then introduces features one by one based on information gain, and constructs a balanced decision table aimed at implementing imbalanced pattern classification. With a 2000:271,132 (true sites:false sites) training set, χ2-DT achieves the highest independent test accuracy (93.34%) when compared with three classifiers (random forest, artificial neural network, and relaxed variable kernel density estimator) and takes a short computation time (89 s). χ2-DT also exhibits good independent test accuracy (92.40%), when validated with BG-570 mutated sequences with frameshift errors (nucleotide insertions and deletions). Moreover, χ2-DT is compared with the long-window size-based methods and the short-window size-based methods, and is found to perform better than all of them in terms of predictive accuracy. Conclusions Based on short window size and imbalanced large samples, the proposed method not only achieves higher predictive accuracy than some existing methods, but also has high computational speed and good robustness against nucleotide insertions and deletions. Reviewers This article was reviewed by Ryan McGinty, Ph.D. and Dirk Walther. Electronic supplementary material The online version of this article (10.1186/s13062-019-0236-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ying Zeng
- Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University, Changsha, 410128, Hunan, China.,Orient Science & Technology College, Hunan Agricultural University, Changsha, 410128, Hunan, China
| | - Hongjie Yuan
- Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University, Changsha, 410128, Hunan, China
| | - Zheming Yuan
- Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University, Changsha, 410128, Hunan, China. .,Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, 410128, Hunan, China.
| | - Yuan Chen
- Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Hunan Agricultural University, Changsha, 410128, Hunan, China.
| |
Collapse
|
17
|
Zhang Y, Liu X, MacLeod J, Liu J. Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach. BMC Genomics 2018; 19:971. [PMID: 30591034 PMCID: PMC6307148 DOI: 10.1186/s12864-018-5350-1] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 12/03/2018] [Indexed: 11/10/2022] Open
Abstract
Background Exon splicing is a regulated cellular process in the transcription of protein-coding genes. Technological advancements and cost reductions in RNA sequencing have made quantitative and qualitative assessments of the transcriptome both possible and widely available. RNA-seq provides unprecedented resolution to identify gene structures and resolve the diversity of splicing variants. However, currently available ab initio aligners are vulnerable to spurious alignments due to random sequence matches and sample-reference genome discordance. As a consequence, a significant set of false positive exon junction predictions would be introduced, which will further confuse downstream analyses of splice variant discovery and abundance estimation. Results In this work, we present a deep learning based splice junction sequence classifier, named DeepSplice, which employs convolutional neural networks to classify candidate splice junctions. We show (I) DeepSplice outperforms state-of-the-art methods for splice site classification when applied to the popular benchmark dataset HS3D, (II) DeepSplice shows high accuracy for splice junction classification with GENCODE annotation, and (III) the application of DeepSplice to classify putative splice junctions generated by Rail-RNA alignment of 21,504 human RNA-seq data significantly reduces 43 million candidates into around 3 million highly confident novel splice junctions. Conclusions A model inferred from the sequences of annotated exon junctions that can then classify splice junctions derived from primary RNA-seq data has been implemented. The performance of the model was evaluated and compared through comprehensive benchmarking and testing, indicating a reliable performance and gross usability for classifying novel splice junctions derived from RNA-seq alignment. Electronic supplementary material The online version of this article (10.1186/s12864-018-5350-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yi Zhang
- Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA.
| | - Xinan Liu
- Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA
| | - James MacLeod
- Department of Veterinary Science, University of Kentucky, Lexington, KY, 40506, USA
| | - Jinze Liu
- Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA
| |
Collapse
|
18
|
Plomion C, Aury JM, Amselem J, Leroy T, Murat F, Duplessis S, Faye S, Francillonne N, Labadie K, Le Provost G, Lesur I, Bartholomé J, Faivre-Rampant P, Kohler A, Leplé JC, Chantret N, Chen J, Diévart A, Alaeitabar T, Barbe V, Belser C, Bergès H, Bodénès C, Bogeat-Triboulot MB, Bouffaud ML, Brachi B, Chancerel E, Cohen D, Couloux A, Da Silva C, Dossat C, Ehrenmann F, Gaspin C, Grima-Pettenati J, Guichoux E, Hecker A, Herrmann S, Hugueney P, Hummel I, Klopp C, Lalanne C, Lascoux M, Lasserre E, Lemainque A, Desprez-Loustau ML, Luyten I, Madoui MA, Mangenot S, Marchal C, Maumus F, Mercier J, Michotey C, Panaud O, Picault N, Rouhier N, Rué O, Rustenholz C, Salin F, Soler M, Tarkka M, Velt A, Zanne AE, Martin F, Wincker P, Quesneville H, Kremer A, Salse J. Oak genome reveals facets of long lifespan. NATURE PLANTS 2018; 4:440-452. [PMID: 29915331 PMCID: PMC6086335 DOI: 10.1038/s41477-018-0172-3] [Citation(s) in RCA: 212] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/12/2017] [Accepted: 05/08/2018] [Indexed: 05/18/2023]
Abstract
Oaks are an important part of our natural and cultural heritage. Not only are they ubiquitous in our most common landscapes1 but they have also supplied human societies with invaluable services, including food and shelter, since prehistoric times2. With 450 species spread throughout Asia, Europe and America3, oaks constitute a critical global renewable resource. The longevity of oaks (several hundred years) probably underlies their emblematic cultural and historical importance. Such long-lived sessile organisms must persist in the face of a wide range of abiotic and biotic threats over their lifespans. We investigated the genomic features associated with such a long lifespan by sequencing, assembling and annotating the oak genome. We then used the growing number of whole-genome sequences for plants (including tree and herbaceous species) to investigate the parallel evolution of genomic characteristics potentially underpinning tree longevity. A further consequence of the long lifespan of trees is their accumulation of somatic mutations during mitotic divisions of stem cells present in the shoot apical meristems. Empirical4 and modelling5 approaches have shown that intra-organismal genetic heterogeneity can be selected for6 and provides direct fitness benefits in the arms race with short-lived pests and pathogens through a patchwork of intra-organismal phenotypes7. However, there is no clear proof that large-statured trees consist of a genetic mosaic of clonally distinct cell lineages within and between branches. Through this case study of oak, we demonstrate the accumulation and transmission of somatic mutations and the expansion of disease-resistance gene families in trees.
Collapse
Affiliation(s)
| | - Jean-Marc Aury
- Commissariat à l'Energie Atomique (CEA), Genoscope, Institut de Biologie François-Jacob, Evry, France
| | | | | | | | | | - Sébastien Faye
- Commissariat à l'Energie Atomique (CEA), Genoscope, Institut de Biologie François-Jacob, Evry, France
| | | | - Karine Labadie
- Commissariat à l'Energie Atomique (CEA), Genoscope, Institut de Biologie François-Jacob, Evry, France
| | | | - Isabelle Lesur
- BIOGECO, INRA, Université de Bordeaux, Cestas, France
- HelixVenture, Mérignac, France
| | | | | | | | | | - Nathalie Chantret
- AGAP, Université de Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France
| | - Jun Chen
- Department of Ecology and Genetics, Evolutionary Biology Centre, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Anne Diévart
- CIRAD, UMR AGAP, Montpellier, France
- Université de Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France
| | | | - Valérie Barbe
- Commissariat à l'Energie Atomique (CEA), Genoscope, Institut de Biologie François-Jacob, Evry, France
| | - Caroline Belser
- Commissariat à l'Energie Atomique (CEA), Genoscope, Institut de Biologie François-Jacob, Evry, France
| | | | | | | | - Marie-Lara Bouffaud
- Department of Soil Ecology, UFZ-Helmholtz Centre for Environmental Research, Halle/Saale, Germany
| | | | | | - David Cohen
- UMR Silva, INRA, Université de Lorraine, AgroPariTech, Nancy, France
| | - Arnaud Couloux
- Commissariat à l'Energie Atomique (CEA), Genoscope, Institut de Biologie François-Jacob, Evry, France
| | - Corinne Da Silva
- Commissariat à l'Energie Atomique (CEA), Genoscope, Institut de Biologie François-Jacob, Evry, France
| | - Carole Dossat
- Commissariat à l'Energie Atomique (CEA), Genoscope, Institut de Biologie François-Jacob, Evry, France
| | | | - Christine Gaspin
- Plateforme bioinformatique Toulouse Midi-Pyrénées, INRA, Auzeville Castanet-Tolosan, France
| | | | | | - Arnaud Hecker
- IAM, INRA, Université de Lorraine, Champenoux, France
| | - Sylvie Herrmann
- German Centre for Integrative Research (iDiv), Halle-Jena-Leipzig, Leipzig, Germany
| | | | - Irène Hummel
- UMR Silva, INRA, Université de Lorraine, AgroPariTech, Nancy, France
| | - Christophe Klopp
- Plateforme bioinformatique Toulouse Midi-Pyrénées, INRA, Auzeville Castanet-Tolosan, France
| | | | - Martin Lascoux
- Department of Ecology and Genetics, Evolutionary Biology Centre, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Eric Lasserre
- Université de Perpignan, UMR 5096, Perpignan, France
| | - Arnaud Lemainque
- Commissariat à l'Energie Atomique (CEA), Genoscope, Institut de Biologie François-Jacob, Evry, France
| | | | | | - Mohammed-Amin Madoui
- Commissariat à l'Energie Atomique (CEA), Genoscope, Institut de Biologie François-Jacob, Evry, France
| | - Sophie Mangenot
- Commissariat à l'Energie Atomique (CEA), Genoscope, Institut de Biologie François-Jacob, Evry, France
| | | | | | - Jonathan Mercier
- Commissariat à l'Energie Atomique (CEA), Genoscope, Institut de Biologie François-Jacob, Evry, France
| | | | | | | | | | - Olivier Rué
- Plateforme bioinformatique Toulouse Midi-Pyrénées, INRA, Auzeville Castanet-Tolosan, France
| | | | - Franck Salin
- BIOGECO, INRA, Université de Bordeaux, Cestas, France
| | - Marçal Soler
- Université de Toulouse, CNRS, UMR 5546, LRSV, Castanet-Tolosan, France
- Laboratori del Suro, University of Girona, Girona, Spain
| | - Mika Tarkka
- Department of Soil Ecology, UFZ-Helmholtz Centre for Environmental Research, Halle/Saale, Germany
| | - Amandine Velt
- SVQV, Université de Strasbourg, INRA, Colmar, France
| | - Amy E Zanne
- Department of Biological Sciences, George Washington University, Washington, DC, USA
| | | | - Patrick Wincker
- Génomique Métabolique, Genoscope, Institut de Biologie François-Jacob, Commissariat à l'Energie Atomique (CEA), CNRS, Université d'Evry, Université Paris-Saclay, Evry, France
| | | | | | | |
Collapse
|
19
|
Zuallaert J, Godin F, Kim M, Soete A, Saeys Y, De Neve W. SpliceRover: interpretable convolutional neural networks for improved splice site prediction. Bioinformatics 2018; 34:4180-4188. [DOI: 10.1093/bioinformatics/bty497] [Citation(s) in RCA: 58] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2017] [Accepted: 06/19/2018] [Indexed: 11/13/2022] Open
Affiliation(s)
- Jasper Zuallaert
- Center for Biotech Data Science, Department of Environmental Technology, Food Technology and Molecular Biotechnology, Ghent University Global Campus, Songdo, Incheon, South Korea
- IDLab, Department for Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Fréderic Godin
- IDLab, Department for Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Mijung Kim
- Center for Biotech Data Science, Department of Environmental Technology, Food Technology and Molecular Biotechnology, Ghent University Global Campus, Songdo, Incheon, South Korea
- IDLab, Department for Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Arne Soete
- Department of Biomedical Molecular Biology, Ghent University, Ghent, Belgium
- Data Mining and Modeling for Biomedicine, VIB Inflammation Research Center, Ghent, Belgium
| | - Yvan Saeys
- Data Mining and Modeling for Biomedicine, VIB Inflammation Research Center, Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
| | - Wesley De Neve
- Center for Biotech Data Science, Department of Environmental Technology, Food Technology and Molecular Biotechnology, Ghent University Global Campus, Songdo, Incheon, South Korea
- IDLab, Department for Electronics and Information Systems, Ghent University, Ghent, Belgium
| |
Collapse
|
20
|
SpliceVec: Distributed feature representations for splice junction prediction. Comput Biol Chem 2018; 74:434-441. [DOI: 10.1016/j.compbiolchem.2018.03.009] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Accepted: 03/12/2018] [Indexed: 12/12/2022]
|
21
|
Abstract
Accurate splice-site prediction is essential to delineate gene structures from sequence data. Several computational techniques have been applied to create a system to predict canonical splice sites. For classification tasks, deep neural networks (DNNs) have achieved record-breaking results and often outperformed other supervised learning techniques. In this study, a new method of splice-site prediction using DNNs was proposed. The proposed system receives an input sequence data and returns an answer as to whether it is splice site. The length of input is 140 nucleotides, with the consensus sequence (i.e., "GT" and "AG" for the donor and acceptor sites, respectively) in the middle. Each input sequence model is applied to the pretrained DNN model that determines the probability that an input is a splice site. The model consists of convolutional layers and bidirectional long short-term memory network layers. The pretraining and validation were conducted using the data set tested in previously reported methods. The performance evaluation results showed that the proposed method can outperform the previous methods. In addition, the pattern learned by the DNNs was visualized as position frequency matrices (PFMs). Some of PFMs were very similar to the consensus sequence. The trained DNN model and the brief source code for the prediction system are uploaded. Further improvement will be achieved following the further development of DNNs.
Collapse
Affiliation(s)
- Tatsuhiko Naito
- Department of Neurology, Graduate School of Medicine, The University of Tokyo , Tokyo, Japan
| |
Collapse
|
22
|
Hybridization and polyploidy enable genomic plasticity without sex in the most devastating plant-parasitic nematodes. PLoS Genet 2017; 13:e1006777. [PMID: 28594822 PMCID: PMC5465968 DOI: 10.1371/journal.pgen.1006777] [Citation(s) in RCA: 116] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2017] [Accepted: 04/24/2017] [Indexed: 11/19/2022] Open
Abstract
Root-knot nematodes (genus Meloidogyne) exhibit a diversity of reproductive modes ranging from obligatory sexual to fully asexual reproduction. Intriguingly, the most widespread and devastating species to global agriculture are those that reproduce asexually, without meiosis. To disentangle this surprising parasitic success despite the absence of sex and genetic exchanges, we have sequenced and assembled the genomes of three obligatory ameiotic and asexual Meloidogyne. We have compared them to those of relatives able to perform meiosis and sexual reproduction. We show that the genomes of ameiotic asexual Meloidogyne are large, polyploid and made of duplicated regions with a high within-species average nucleotide divergence of ~8%. Phylogenomic analysis of the genes present in these duplicated regions suggests that they originated from multiple hybridization events and are thus homoeologs. We found that up to 22% of homoeologous gene pairs were under positive selection and these genes covered a wide spectrum of predicted functional categories. To biologically assess functional divergence, we compared expression patterns of homoeologous gene pairs across developmental life stages using an RNAseq approach in the most economically important asexually-reproducing nematode. We showed that >60% of homoeologous gene pairs display diverged expression patterns. These results suggest a substantial functional impact of the genome structure. Contrasting with high within-species nuclear genome divergence, mitochondrial genome divergence between the three ameiotic asexuals was very low, signifying that these putative hybrids share a recent common maternal ancestor. Transposable elements (TE) cover a ~1.7 times higher proportion of the genomes of the ameiotic asexual Meloidogyne compared to the sexual relative and might also participate in their plasticity. The intriguing parasitic success of asexually-reproducing Meloidogyne species could be partly explained by their TE-rich composite genomes, resulting from allopolyploidization events, and promoting plasticity and functional divergence between gene copies in the absence of sex and meiosis.
Collapse
|
23
|
Cormier A, Avia K, Sterck L, Derrien T, Wucher V, Andres G, Monsoor M, Godfroy O, Lipinska A, Perrineau MM, Van De Peer Y, Hitte C, Corre E, Coelho SM, Cock JM. Re-annotation, improved large-scale assembly and establishment of a catalogue of noncoding loci for the genome of the model brown alga Ectocarpus. THE NEW PHYTOLOGIST 2017; 214:219-232. [PMID: 27870061 DOI: 10.1111/nph.14321] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/23/2016] [Accepted: 10/08/2016] [Indexed: 05/28/2023]
Abstract
The genome of the filamentous brown alga Ectocarpus was the first to be completely sequenced from within the brown algal group and has served as a key reference genome both for this lineage and for the stramenopiles. We present a complete structural and functional reannotation of the Ectocarpus genome. The large-scale assembly of the Ectocarpus genome was significantly improved and genome-wide gene re-annotation using extensive RNA-seq data improved the structure of 11 108 existing protein-coding genes and added 2030 new loci. A genome-wide analysis of splicing isoforms identified an average of 1.6 transcripts per locus. A large number of previously undescribed noncoding genes were identified and annotated, including 717 loci that produce long noncoding RNAs. Conservation of lncRNAs between Ectocarpus and another brown alga, the kelp Saccharina japonica, suggests that at least a proportion of these loci serve a function. Finally, a large collection of single nucleotide polymorphism-based markers was developed for genetic analyses. These resources are available through an updated and improved genome database. This study significantly improves the utility of the Ectocarpus genome as a high-quality reference for the study of many important aspects of brown algal biology and as a reference for genomic analyses across the stramenopiles.
Collapse
Affiliation(s)
- Alexandre Cormier
- Algal Genetics Group, CNRS, UMR 8227, Integrative Biology of Marine Models, Sorbonne Université, UPMC Univ Paris 06, Station Biologique de Roscoff, CS 90074, F-29688, Roscoff, France
| | - Komlan Avia
- Algal Genetics Group, CNRS, UMR 8227, Integrative Biology of Marine Models, Sorbonne Université, UPMC Univ Paris 06, Station Biologique de Roscoff, CS 90074, F-29688, Roscoff, France
| | - Lieven Sterck
- Department of Plant Systems Biology, VIB, B-9052, Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, B-9000, Ghent, Belgium
- Bioinformatics Institute Ghent, Technologiepark 927, 9052, Ghent, Belgium
| | | | | | - Gwendoline Andres
- Abims Platform, CNRS-UPMC, FR2424, Station Biologique de Roscoff, CS 90074, 29688, Roscoff, France
| | - Misharl Monsoor
- Abims Platform, CNRS-UPMC, FR2424, Station Biologique de Roscoff, CS 90074, 29688, Roscoff, France
| | - Olivier Godfroy
- Algal Genetics Group, CNRS, UMR 8227, Integrative Biology of Marine Models, Sorbonne Université, UPMC Univ Paris 06, Station Biologique de Roscoff, CS 90074, F-29688, Roscoff, France
| | - Agnieszka Lipinska
- Algal Genetics Group, CNRS, UMR 8227, Integrative Biology of Marine Models, Sorbonne Université, UPMC Univ Paris 06, Station Biologique de Roscoff, CS 90074, F-29688, Roscoff, France
| | - Marie-Mathilde Perrineau
- Algal Genetics Group, CNRS, UMR 8227, Integrative Biology of Marine Models, Sorbonne Université, UPMC Univ Paris 06, Station Biologique de Roscoff, CS 90074, F-29688, Roscoff, France
| | - Yves Van De Peer
- Department of Plant Systems Biology, VIB, B-9052, Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, B-9000, Ghent, Belgium
- Bioinformatics Institute Ghent, Technologiepark 927, 9052, Ghent, Belgium
- Department of Genetics, Genomics Research Institute, University of Pretoria, 0028, Pretoria, South Africa
| | | | - Erwan Corre
- Abims Platform, CNRS-UPMC, FR2424, Station Biologique de Roscoff, CS 90074, 29688, Roscoff, France
| | - Susana M Coelho
- Algal Genetics Group, CNRS, UMR 8227, Integrative Biology of Marine Models, Sorbonne Université, UPMC Univ Paris 06, Station Biologique de Roscoff, CS 90074, F-29688, Roscoff, France
| | - J Mark Cock
- Algal Genetics Group, CNRS, UMR 8227, Integrative Biology of Marine Models, Sorbonne Université, UPMC Univ Paris 06, Station Biologique de Roscoff, CS 90074, F-29688, Roscoff, France
| |
Collapse
|
24
|
Alt-Splice Gene Predictor Using Multitrack-Clique Analysis: Verification of Statistical Support for Modelling in Genomes of Multicellular Eukaryotes. INFORMATICS 2017. [DOI: 10.3390/informatics4010003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
25
|
Meher PK, Sahu TK, Rao AR, Wahi SD. A computational approach for prediction of donor splice sites with improved accuracy. J Theor Biol 2016; 404:285-294. [PMID: 27302911 DOI: 10.1016/j.jtbi.2016.06.013] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2015] [Revised: 04/18/2016] [Accepted: 06/09/2016] [Indexed: 11/24/2022]
Abstract
Identification of splice sites is important due to their key role in predicting the exon-intron structure of protein coding genes. Though several approaches have been developed for the prediction of splice sites, further improvement in the prediction accuracy will help predict gene structure more accurately. This paper presents a computational approach for prediction of donor splice sites with higher accuracy. In this approach, true and false splice sites were first encoded into numeric vectors and then used as input in artificial neural network (ANN), support vector machine (SVM) and random forest (RF) for prediction. ANN and SVM were found to perform equally and better than RF, while tested on HS3D and NN269 datasets. Further, the performance of ANN, SVM and RF were analyzed by using an independent test set of 50 genes and found that the prediction accuracy of ANN was higher than that of SVM and RF. All the predictors achieved higher accuracy while compared with the existing methods like NNsplice, MEM, MDD, WMM, MM1, FSPLICE, GeneID and ASSP, using the independent test set. We have also developed an online prediction server (PreDOSS) available at http://cabgrid.res.in:8080/predoss, for prediction of donor splice sites using the proposed approach.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - Tanmaya Kumar Sahu
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - A R Rao
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - S D Wahi
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| |
Collapse
|
26
|
Pérez-Rodríguez J, García-Pedrajas N. Stepwise approach for combining many sources of evidence for site-recognition in genomic sequences. BMC Bioinformatics 2016; 17:117. [PMID: 26945666 PMCID: PMC4779560 DOI: 10.1186/s12859-016-0968-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Accepted: 02/22/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recognizing the different functional parts of genes, such as promoters, translation initiation sites, donors, acceptors and stop codons, is a fundamental task of many current studies in Bioinformatics. Currently, the most successful methods use powerful classifiers, such as support vector machines with various string kernels. However, with the rapid evolution of our ability to collect genomic information, it has been shown that combining many sources of evidence is fundamental to the success of any recognition task. With the advent of next-generation sequencing, the number of available genomes is increasing very rapidly. Thus, methods for making use of such large amounts of information are needed. RESULTS In this paper, we present a methodology for combining tens or even hundreds of different classifiers for an improved performance. Our approach can include almost a limitless number of sources of evidence. We can use the evidence for the prediction of sites in a certain species, such as human, or other species as needed. This approach can be used for any of the functional recognition tasks cited above. However, to provide the necessary focus, we have tested our approach in two functional recognition tasks: translation initiation site and stop codon recognition. We have used the entire human genome as a target and another 20 species as sources of evidence and tested our method on five different human chromosomes. The proposed method achieves better accuracy than the best state-of-the-art method both in terms of the geometric mean of the specificity and sensitivity and the area under the receiver operating characteristic and precision recall curves. Furthermore, our approach shows a more principled way for selecting the best genomes to be combined for a given recognition task. CONCLUSIONS Our approach has proven to be a powerful tool for improving the performance of functional site recognition, and it is a useful method for combining many sources of evidence for any recognition task in Bioinformatics. The results also show that the common approach of heuristically choosing the species to be used as source of evidence can be improved because the best combinations of genomes for recognition were those not usually selected. Although the experiments were performed for translation initiation site and stop codon recognition, any other recognition task may benefit from our methodology.
Collapse
Affiliation(s)
- Javier Pérez-Rodríguez
- Department of Computing and Numerical Analysis, University of Córdoba, Córdoba, 14071, Campus de Rabanales, Spain.
| | - Nicolás García-Pedrajas
- Department of Computing and Numerical Analysis, University of Córdoba, Córdoba, 14071, Campus de Rabanales, Spain.
| |
Collapse
|
27
|
The genome of the seagrass Zostera marina reveals angiosperm adaptation to the sea. Nature 2016; 530:331-5. [PMID: 26814964 DOI: 10.1038/nature16548] [Citation(s) in RCA: 318] [Impact Index Per Article: 35.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2015] [Accepted: 12/18/2015] [Indexed: 11/09/2022]
Abstract
Seagrasses colonized the sea on at least three independent occasions to form the basis of one of the most productive and widespread coastal ecosystems on the planet. Here we report the genome of Zostera marina (L.), the first, to our knowledge, marine angiosperm to be fully sequenced. This reveals unique insights into the genomic losses and gains involved in achieving the structural and physiological adaptations required for its marine lifestyle, arguably the most severe habitat shift ever accomplished by flowering plants. Key angiosperm innovations that were lost include the entire repertoire of stomatal genes, genes involved in the synthesis of terpenoids and ethylene signalling, and genes for ultraviolet protection and phytochromes for far-red sensing. Seagrasses have also regained functions enabling them to adjust to full salinity. Their cell walls contain all of the polysaccharides typical of land plants, but also contain polyanionic, low-methylated pectins and sulfated galactans, a feature shared with the cell walls of all macroalgae and that is important for ion homoeostasis, nutrient uptake and O2/CO2 exchange through leaf epidermal cells. The Z. marina genome resource will markedly advance a wide range of functional ecological studies from adaptation of marine ecosystems under climate warming, to unravelling the mechanisms of osmoregulation under high salinities that may further inform our understanding of the evolution of salt tolerance in crop plants.
Collapse
|
28
|
Chiapello H, Mallet L, Guérin C, Aguileta G, Amselem J, Kroj T, Ortega-Abboud E, Lebrun MH, Henrissat B, Gendrault A, Rodolphe F, Tharreau D, Fournier E. Deciphering Genome Content and Evolutionary Relationships of Isolates from the Fungus Magnaporthe oryzae Attacking Different Host Plants. Genome Biol Evol 2015; 7:2896-912. [PMID: 26454013 PMCID: PMC4684704 DOI: 10.1093/gbe/evv187] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
Deciphering the genetic bases of pathogen adaptation to its host is a key question in ecology and evolution. To understand how the fungus Magnaporthe oryzae adapts to different plants, we sequenced eight M. oryzae isolates differing in host specificity (rice, foxtail millet, wheat, and goosegrass), and one Magnaporthe grisea isolate specific of crabgrass. Analysis of Magnaporthe genomes revealed small variation in genome sizes (39–43 Mb) and gene content (12,283–14,781 genes) between isolates. The whole set of Magnaporthe genes comprised 14,966 shared families, 63% of which included genes present in all the nine M. oryzae genomes. The evolutionary relationships among Magnaporthe isolates were inferred using 6,878 single-copy orthologs. The resulting genealogy was mostly bifurcating among the different host-specific lineages, but was reticulate inside the rice lineage. We detected traces of introgression from a nonrice genome in the rice reference 70-15 genome. Among M. oryzae isolates and host-specific lineages, the genome composition in terms of frequencies of genes putatively involved in pathogenicity (effectors, secondary metabolism, cazome) was conserved. However, 529 shared families were found only in nonrice lineages, whereas the rice lineage possessed 86 specific families absent from the nonrice genomes. Our results confirmed that the host specificity of M. oryzae isolates was associated with a divergence between lineages without major gene flow and that, despite the strong conservation of gene families between lineages, adaptation to different hosts, especially to rice, was associated with the presence of a small number of specific gene families. All information was gathered in a public database (http://genome.jouy.inra.fr/gemo).
Collapse
Affiliation(s)
- Hélène Chiapello
- INRA, UR 1404, Unité Mathématiques et Informatique Appliquées du Génome à l'Environnement, Jouy-en-Josas, France INRA, UR 875, Unité Mathématiques et Informatique Appliquées de Toulouse, Castanet-Tolosan, France
| | - Ludovic Mallet
- INRA, UR 1404, Unité Mathématiques et Informatique Appliquées du Génome à l'Environnement, Jouy-en-Josas, France INRA, UR 875, Unité Mathématiques et Informatique Appliquées de Toulouse, Castanet-Tolosan, France INRA, UR 1164, Unité de Recherche Génomique Info, Versailles, France
| | - Cyprien Guérin
- INRA, UR 1404, Unité Mathématiques et Informatique Appliquées du Génome à l'Environnement, Jouy-en-Josas, France
| | - Gabriela Aguileta
- CNRS, UMR 8079, Ecologie, Systématique et Evolution, Université Paris-Sud, Orsay, France Center for Genomic Regulation, Barcelona, Spain
| | - Joëlle Amselem
- INRA, UR 1164, Unité de Recherche Génomique Info, Versailles, France
| | - Thomas Kroj
- INRA, UMR 385, Biologie et Génétique des Interactions Plantes-Pathogènes BGPI, INRA-CIRAD-Montpellier SupAgro, Campus International de Baillarguet, Montpellier, France
| | - Enrique Ortega-Abboud
- CIRAD, UMR 385, Biologie et Génétique des Interactions Plantes-Pathogènes BGPI, INRA-CIRAD-Montpellier SupAgro, Campus International de Baillarguet, Montpellier, France
| | - Marc-Henri Lebrun
- INRA-AgroParisTech, UMR 1190, Biologie et Gestion des Risques en Agriculture BIOGER-CPP, Campus AgroParisTech, Thiverval-Grignon, France
| | - Bernard Henrissat
- Architecture et Fonction des Macromolécules Biologiques, Université d'Aix Marseille, France Department of Biological Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Annie Gendrault
- INRA, UR 1404, Unité Mathématiques et Informatique Appliquées du Génome à l'Environnement, Jouy-en-Josas, France
| | - François Rodolphe
- INRA, UR 1404, Unité Mathématiques et Informatique Appliquées du Génome à l'Environnement, Jouy-en-Josas, France
| | - Didier Tharreau
- CIRAD, UMR 385, Biologie et Génétique des Interactions Plantes-Pathogènes BGPI, INRA-CIRAD-Montpellier SupAgro, Campus International de Baillarguet, Montpellier, France
| | - Elisabeth Fournier
- INRA, UMR 385, Biologie et Génétique des Interactions Plantes-Pathogènes BGPI, INRA-CIRAD-Montpellier SupAgro, Campus International de Baillarguet, Montpellier, France
| |
Collapse
|
29
|
Survey of Programs Used to Detect Alternative Splicing Isoforms from Deep Sequencing Data In Silico. BIOMED RESEARCH INTERNATIONAL 2015; 2015:831352. [PMID: 26421304 PMCID: PMC4573434 DOI: 10.1155/2015/831352] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Revised: 02/17/2015] [Accepted: 03/02/2015] [Indexed: 11/29/2022]
Abstract
Next-generation sequencing techniques have been rapidly emerging. However, the massive sequencing reads hide a great deal of unknown important information. Advances have enabled researchers to discover alternative splicing (AS) sites and isoforms using computational approaches instead of molecular experiments. Given the importance of AS for gene expression and protein diversity in eukaryotes, detecting alternative splicing and isoforms represents a hot topic in systems biology and epigenetics research. The computational methods applied to AS prediction have improved since the emergence of next-generation sequencing. In this study, we introduce state-of-the-art research on AS and then compare the research methods and software tools available for AS based on next-generation sequencing reads. Finally, we discuss the prospects of computational methods related to AS.
Collapse
|
30
|
Morel G, Sterck L, Swennen D, Marcet-Houben M, Onesime D, Levasseur A, Jacques N, Mallet S, Couloux A, Labadie K, Amselem J, Beckerich JM, Henrissat B, Van de Peer Y, Wincker P, Souciet JL, Gabaldón T, Tinsley CR, Casaregola S. Differential gene retention as an evolutionary mechanism to generate biodiversity and adaptation in yeasts. Sci Rep 2015; 5:11571. [PMID: 26108467 PMCID: PMC4479816 DOI: 10.1038/srep11571] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2015] [Accepted: 05/29/2015] [Indexed: 12/13/2022] Open
Abstract
The evolutionary history of the characters underlying the adaptation of microorganisms to food and biotechnological uses is poorly understood. We undertook comparative genomics to investigate evolutionary relationships of the dairy yeast Geotrichum candidum within Saccharomycotina. Surprisingly, a remarkable proportion of genes showed discordant phylogenies, clustering with the filamentous fungus subphylum (Pezizomycotina), rather than the yeast subphylum (Saccharomycotina), of the Ascomycota. These genes appear not to be the result of Horizontal Gene Transfer (HGT), but to have been specifically retained by G. candidum after the filamentous fungi-yeasts split concomitant with the yeasts' genome contraction. We refer to these genes as SRAGs (Specifically Retained Ancestral Genes), having been lost by all or nearly all other yeasts, and thus contributing to the phenotypic specificity of lineages. SRAG functions include lipases consistent with a role in cheese making and novel endoglucanases associated with degradation of plant material. Similar gene retention was observed in three other distantly related yeasts representative of this ecologically diverse subphylum. The phenomenon thus appears to be widespread in the Saccharomycotina and argues that, alongside neo-functionalization following gene duplication and HGT, specific gene retention must be recognized as an important mechanism for generation of biodiversity and adaptation in yeasts.
Collapse
Affiliation(s)
- Guillaume Morel
- INRA UMR1319, Micalis Institute, CIRM-Levures, 78850 F-Thiverval-Grignon, France
- AgroParisTech UMR1319, Micalis Institute, 78850 F-Thiverval-Grignon, France
| | - Lieven Sterck
- Department of Plant Systems Biology VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium
| | - Dominique Swennen
- INRA UMR1319, Micalis Institute, CIRM-Levures, 78850 F-Thiverval-Grignon, France
- AgroParisTech UMR1319, Micalis Institute, 78850 F-Thiverval-Grignon, France
| | - Marina Marcet-Houben
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation, Dr. Aiguader 88, Barcelona 08003, Spain
- Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain
| | - Djamila Onesime
- INRA UMR1319, Micalis Institute, CIRM-Levures, 78850 F-Thiverval-Grignon, France
- AgroParisTech UMR1319, Micalis Institute, 78850 F-Thiverval-Grignon, France
| | - Anthony Levasseur
- INRA UMR1163, Biotechnologie des Champignons Filamenteux, Aix-Marseille Université, Polytech Marseille, 163 avenue de Luminy, CP 925, 13288 Marseille Cedex 09, France
| | - Noémie Jacques
- INRA UMR1319, Micalis Institute, CIRM-Levures, 78850 F-Thiverval-Grignon, France
- AgroParisTech UMR1319, Micalis Institute, 78850 F-Thiverval-Grignon, France
| | - Sandrine Mallet
- INRA UMR1319, Micalis Institute, CIRM-Levures, 78850 F-Thiverval-Grignon, France
- AgroParisTech UMR1319, Micalis Institute, 78850 F-Thiverval-Grignon, France
| | - Arnaux Couloux
- CEA, Institut de Génomique, Genoscope, 2 Rue Gaston Crémieux, Évry F-91000, France
| | - Karine Labadie
- CEA, Institut de Génomique, Genoscope, 2 Rue Gaston Crémieux, Évry F-91000, France
| | - Joëlle Amselem
- INRA UR1164, Unité de Recherche Génomique – Info, 78000 Versailles, France
| | - Jean-Marie Beckerich
- INRA UMR1319, Micalis Institute, CIRM-Levures, 78850 F-Thiverval-Grignon, France
- AgroParisTech UMR1319, Micalis Institute, 78850 F-Thiverval-Grignon, France
| | | | - Yves Van de Peer
- Department of Plant Systems Biology VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium
- Genomics Research Institute, University of Pretoria, Hatfield Campus, Pretoria 0028, South Africa
| | - Patrick Wincker
- CEA, Institut de Génomique, Genoscope, 2 Rue Gaston Crémieux, Évry F-91000, France
- CNRS UMR 8030, 2 Rue Gaston Crémieux, Évry, 91000, France
- Université d’Evry, Bd François Mitterand, Evry,91025, France
| | - Jean-Luc Souciet
- Université de Strasbourg, CNRS UMR7156, Strasbourg, 67000, France
| | - Toni Gabaldón
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation, Dr. Aiguader 88, Barcelona 08003, Spain
- Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain
| | - Colin R. Tinsley
- INRA UMR1319, Micalis Institute, CIRM-Levures, 78850 F-Thiverval-Grignon, France
- AgroParisTech UMR1319, Micalis Institute, 78850 F-Thiverval-Grignon, France
| | - Serge Casaregola
- INRA UMR1319, Micalis Institute, CIRM-Levures, 78850 F-Thiverval-Grignon, France
- AgroParisTech UMR1319, Micalis Institute, 78850 F-Thiverval-Grignon, France
| |
Collapse
|
31
|
Mandal I. A novel approach for accurate identification of splice junctions based on hybrid algorithms. J Biomol Struct Dyn 2015; 33:1281-90. [DOI: 10.1080/07391102.2014.944218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
32
|
A novel approach for predicting DNA splice junctions using hybrid machine learning algorithms. Soft comput 2014. [DOI: 10.1007/s00500-014-1550-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
33
|
Blanc-Mathieu R, Verhelst B, Derelle E, Rombauts S, Bouget FY, Carré I, Château A, Eyre-Walker A, Grimsley N, Moreau H, Piégu B, Rivals E, Schackwitz W, Van de Peer Y, Piganeau G. An improved genome of the model marine alga Ostreococcus tauri unfolds by assessing Illumina de novo assemblies. BMC Genomics 2014; 15:1103. [PMID: 25494611 PMCID: PMC4378021 DOI: 10.1186/1471-2164-15-1103] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2014] [Accepted: 11/19/2014] [Indexed: 12/17/2022] Open
Abstract
Background Cost effective next generation sequencing technologies now enable the production of genomic datasets for many novel planktonic eukaryotes, representing an understudied reservoir of genetic diversity. O. tauri is the smallest free-living photosynthetic eukaryote known to date, a coccoid green alga that was first isolated in 1995 in a lagoon by the Mediterranean sea. Its simple features, ease of culture and the sequencing of its 13 Mb haploid nuclear genome have promoted this microalga as a new model organism for cell biology. Here, we investigated the quality of genome assemblies of Illumina GAIIx 75 bp paired-end reads from Ostreococcus tauri, thereby also improving the existing assembly and showing the genome to be stably maintained in culture. Results The 3 assemblers used, ABySS, CLCBio and Velvet, produced 95% complete genomes in 1402 to 2080 scaffolds with a very low rate of misassembly. Reciprocally, these assemblies improved the original genome assembly by filling in 930 gaps. Combined with additional analysis of raw reads and PCR sequencing effort, 1194 gaps have been solved in total adding up to 460 kb of sequence. Mapping of RNAseq Illumina data on this updated genome led to a twofold reduction in the proportion of multi-exon protein coding genes, representing 19% of the total 7699 protein coding genes. The comparison of the DNA extracted in 2001 and 2009 revealed the fixation of 8 single nucleotide substitutions and 2 deletions during the approximately 6000 generations in the lab. The deletions either knocked out or truncated two predicted transmembrane proteins, including a glutamate-receptor like gene. Conclusion High coverage (>80 fold) paired-end Illumina sequencing enables a high quality 95% complete genome assembly of a compact ~13 Mb haploid eukaryote. This genome sequence has remained stable for 6000 generations of lab culture. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-1103) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | - Gwenaël Piganeau
- CNRS, UMR 7232, Observatoire Océanologique, Avenue du Fontaulé, BP44, 66650 Banyuls-sur-Mer, France.
| |
Collapse
|
34
|
Ahmed S, Cock JM, Pessia E, Luthringer R, Cormier A, Robuchon M, Sterck L, Peters AF, Dittami SM, Corre E, Valero M, Aury JM, Roze D, Van de Peer Y, Bothwell J, Marais GAB, Coelho SM. A haploid system of sex determination in the brown alga Ectocarpus sp. Curr Biol 2014; 24:1945-57. [PMID: 25176635 DOI: 10.1016/j.cub.2014.07.042] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2013] [Revised: 02/11/2014] [Accepted: 07/15/2014] [Indexed: 11/15/2022]
Abstract
BACKGROUND A common feature of most genetic sex-determination systems studied so far is that sex is determined by nonrecombining genomic regions, which can be of various sizes depending on the species. These regions have evolved independently and repeatedly across diverse groups. A number of such sex-determining regions (SDRs) have been studied in animals, plants, and fungi, but very little is known about the evolution of sexes in other eukaryotic lineages. RESULTS We report here the sequencing and genomic analysis of the SDR of Ectocarpus, a brown alga that has been evolving independently from plants, animals, and fungi for over one giga-annum. In Ectocarpus, sex is expressed during the haploid phase of the life cycle, and both the female (U) and the male (V) sex chromosomes contain nonrecombining regions. The U and V of this species have been diverging for more than 70 mega-annum, yet gene degeneration has been modest, and the SDR is relatively small, with no evidence for evolutionary strata. These features may be explained by the occurrence of strong purifying selection during the haploid phase of the life cycle and the low level of sexual dimorphism. V is dominant over U, suggesting that femaleness may be the default state, adopted when the male haplotype is absent. CONCLUSIONS The Ectocarpus UV system has clearly had a distinct evolutionary trajectory not only to the well-studied XY and ZW systems but also to the UV systems described so far. Nonetheless, some striking similarities exist, indicating remarkable universality of the underlying processes shaping sex chromosome evolution across distant lineages.
Collapse
Affiliation(s)
- Sophia Ahmed
- Integrative Biology of Marine Models, CNRS UMR 8227, Sorbonne Universités, UPMC Université Paris 6, Station Biologique de Roscoff, CS 90074, 29688 Roscoff, France; Medical Biology Centre, Queens University Belfast, Belfast BT9 7BL, Northern Ireland, UK
| | - J Mark Cock
- Integrative Biology of Marine Models, CNRS UMR 8227, Sorbonne Universités, UPMC Université Paris 6, Station Biologique de Roscoff, CS 90074, 29688 Roscoff, France
| | - Eugenie Pessia
- Laboratoire de Biométrie et Biologie Évolutive, UMR 5558, Centre National de la Recherche Scientifique, Université Lyon 1, 69622 Villeurbanne, France
| | - Remy Luthringer
- Integrative Biology of Marine Models, CNRS UMR 8227, Sorbonne Universités, UPMC Université Paris 6, Station Biologique de Roscoff, CS 90074, 29688 Roscoff, France
| | - Alexandre Cormier
- Integrative Biology of Marine Models, CNRS UMR 8227, Sorbonne Universités, UPMC Université Paris 6, Station Biologique de Roscoff, CS 90074, 29688 Roscoff, France
| | - Marine Robuchon
- Integrative Biology of Marine Models, CNRS UMR 8227, Sorbonne Universités, UPMC Université Paris 6, Station Biologique de Roscoff, CS 90074, 29688 Roscoff, France; Evolutionary Biology and Ecology of Algae, CNRS UMI 3604, Sorbonne Université, UPMC, PUCCh, UACH, Station Biologique de Roscoff, CS 90074, 29688 Roscoff, France
| | - Lieven Sterck
- Department of Plant Systems Biology (VIB) and Department of Plant Biotechnology and Bioinformatics (Ghent University), Technologiepark 927, 9052 Gent, Belgium
| | | | - Simon M Dittami
- Integrative Biology of Marine Models, CNRS UMR 8227, Sorbonne Universités, UPMC Université Paris 6, Station Biologique de Roscoff, CS 90074, 29688 Roscoff, France
| | - Erwan Corre
- ABiMS Platform, FR2424, Station Biologique de Roscoff, CS 90074, 29688 Roscoff, France
| | - Myriam Valero
- Evolutionary Biology and Ecology of Algae, CNRS UMI 3604, Sorbonne Université, UPMC, PUCCh, UACH, Station Biologique de Roscoff, CS 90074, 29688 Roscoff, France
| | - Jean-Marc Aury
- Commissariat à l'Energie Atomique (CEA), Institut de Génomique (IG), Genoscope, 91000 Evry, France
| | - Denis Roze
- Evolutionary Biology and Ecology of Algae, CNRS UMI 3604, Sorbonne Université, UPMC, PUCCh, UACH, Station Biologique de Roscoff, CS 90074, 29688 Roscoff, France
| | - Yves Van de Peer
- Department of Plant Systems Biology (VIB) and Department of Plant Biotechnology and Bioinformatics (Ghent University), Technologiepark 927, 9052 Gent, Belgium; Genomics Research Institute, University of Pretoria, Hatfield Campus, Pretoria 0028, South Africa
| | - John Bothwell
- Medical Biology Centre, Queens University Belfast, Belfast BT9 7BL, Northern Ireland, UK
| | - Gabriel A B Marais
- Laboratoire de Biométrie et Biologie Évolutive, UMR 5558, Centre National de la Recherche Scientifique, Université Lyon 1, 69622 Villeurbanne, France
| | - Susana M Coelho
- Integrative Biology of Marine Models, CNRS UMR 8227, Sorbonne Universités, UPMC Université Paris 6, Station Biologique de Roscoff, CS 90074, 29688 Roscoff, France.
| |
Collapse
|
35
|
Lo C, Kakaradov B, Lokshtanov D, Boucher C. SeeSite: Characterizing Relationships between Splice Junctions and Splicing Enhancers. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:648-656. [PMID: 26356335 DOI: 10.1109/tcbb.2014.2304294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
RNA splicing is a cellular process driven by the interaction between numerous regulatory sequences and binding sites, however, such interactions have been primarily explored by laboratory methods since computational tools largely ignore the relationship between different splicing elements. Current computational methods identify either splice sites or other regulatory sequences, such as enhancers and silencers. We present a novel approach for characterizing co-occurring relationships between splice site motifs and splicing enhancers. Our approach relies on an efficient algorithm for approximately solving Consensus Sequence with Outliers , an NP-complete string clustering problem. In particular, we give an algorithm for this problem that outputs near-optimal solutions in polynomial time. To our knowledge, this is the first formulation and computational attempt for detecting co-occurring sequence elements in RNA sequence data. Further, we demonstrate that SeeSite is capable of showing that certain ESEs are preferentially associated with weaker splice sites, and that there exists a co-occurrence relationship with splice site motifs.
Collapse
|
36
|
Pérez-Rodríguez J, Arroyo-Peña AG, García-Pedrajas N. Improving translation initiation site and stop codon recognition by using more than two classes. Bioinformatics 2014; 30:2702-8. [PMID: 24903421 DOI: 10.1093/bioinformatics/btu369] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The recognition of translation initiation sites and stop codons is a fundamental part of any gene recognition program. Currently, the most successful methods use powerful classifiers, such as support vector machines with various string kernels. These methods all use two classes, one of positive instances and another one of negative instances that are constructed using sequences from the whole genome. However, the features of the negative sequences differ depending on the position of the negative samples in the gene. There are differences depending on whether they are from exons, introns, intergenic regions or any other functional part of the genome. Thus, the positive class is fairly homogeneous, as all its sequences come from the same part of the gene, but the negative class is composed of different instances. The classifier suffers from this problem. In this article, we propose the training of different classifiers with different negative, more homogeneous, classes and the combination of these classifiers for improved accuracy. RESULTS The proposed method achieves better accuracy than the best state-of-the-art method, both in terms of the geometric mean of the specificity and sensitivity and the area under the receiver operating characteristic and precision recall curves. The method is tested on the whole human genome. The results for recognizing both translation initiation sites and stop codons indicated improvements in the rates of both false-negative results (FN) and false-positive results (FP). On an average, for translation initiation site recognition, the false-negative ratio was reduced by 30.2% and the FP ratio decreased by 10.9%. For stop codon prediction, FP were reduced by 41.4% and FN by 31.7%. AVAILABILITY AND IMPLEMENTATION The source code is licensed under the General Public License and is thus freely available. The datasets and source code can be obtained from http://cib.uco.es/site-recognition. CONTACT npedrajas@uco.es.
Collapse
Affiliation(s)
- Javier Pérez-Rodríguez
- Department of Computing and Numerical Analysis, University of Córdoba, Campus Universitario de Rabanales, Edificio Einstein, Planta 3, 14071 Córdoba, Spain
| | - Alexis G Arroyo-Peña
- Department of Computing and Numerical Analysis, University of Córdoba, Campus Universitario de Rabanales, Edificio Einstein, Planta 3, 14071 Córdoba, Spain
| | - Nicolás García-Pedrajas
- Department of Computing and Numerical Analysis, University of Córdoba, Campus Universitario de Rabanales, Edificio Einstein, Planta 3, 14071 Córdoba, Spain
| |
Collapse
|
37
|
Zimmer AD, Lang D, Buchta K, Rombauts S, Nishiyama T, Hasebe M, Van de Peer Y, Rensing SA, Reski R. Reannotation and extended community resources for the genome of the non-seed plant Physcomitrella patens provide insights into the evolution of plant gene structures and functions. BMC Genomics 2013; 14:498. [PMID: 23879659 PMCID: PMC3729371 DOI: 10.1186/1471-2164-14-498] [Citation(s) in RCA: 136] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2013] [Accepted: 07/19/2013] [Indexed: 11/24/2022] Open
Abstract
Background The moss Physcomitrella patens as a model species provides an important reference for early-diverging lineages of plants and the release of the genome in 2008 opened the doors to genome-wide studies. The usability of a reference genome greatly depends on the quality of the annotation and the availability of centralized community resources. Therefore, in the light of accumulating evidence for missing genes, fragmentary gene structures, false annotations and a low rate of functional annotations on the original release, we decided to improve the moss genome annotation. Results Here, we report the complete moss genome re-annotation (designated V1.6) incorporating the increased transcript availability from a multitude of developmental stages and tissue types. We demonstrate the utility of the improved P. patens genome annotation for comparative genomics and new extensions to the cosmoss.org resource as a central repository for this plant “flagship” genome. The structural annotation of 32,275 protein-coding genes results in 8387 additional loci including 1456 loci with known protein domains or homologs in Plantae. This is the first release to include information on transcript isoforms, suggesting alternative splicing events for at least 10.8% of the loci. Furthermore, this release now also provides information on non-protein-coding loci. Functional annotations were improved regarding quality and coverage, resulting in 58% annotated loci (previously: 41%) that comprise also 7200 additional loci with GO annotations. Access and manual curation of the functional and structural genome annotation is provided via the http://www.cosmoss.org model organism database. Conclusions Comparative analysis of gene structure evolution along the green plant lineage provides novel insights, such as a comparatively high number of loci with 5’-UTR introns in the moss. Comparative analysis of functional annotations reveals expansions of moss house-keeping and metabolic genes and further possibly adaptive, lineage-specific expansions and gains including at least 13% orphan genes.
Collapse
Affiliation(s)
- Andreas D Zimmer
- Plant Biotechnology, Faculty of Biology, University of Freiburg, Schaenzlestrasse 1, 79104, Freiburg, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
38
|
Abstract
Plant pararetroviruses integrate serendipitously into their host genomes. The banana genome harbors integrated copies of banana streak virus (BSV) named endogenous BSV (eBSV) that are able to release infectious pararetrovirus. In this investigation, we characterized integrants of three BSV species-Goldfinger (eBSGFV), Imove (eBSImV), and Obino l'Ewai (eBSOLV)-in the seedy Musa balbisiana Pisang klutuk wulung (PKW) by studying their molecular structure, genomic organization, genomic landscape, and infectious capacity. All eBSVs exhibit extensive viral genome duplications and rearrangements. eBSV segregation analysis on an F1 population of PKW combined with fluorescent in situ hybridization analysis showed that eBSImV, eBSOLV, and eBSGFV are each present at a single locus. eBSOLV and eBSGFV contain two distinct alleles, whereas eBSImV has two structurally identical alleles. Genotyping of both eBSV and viral particles expressed in the progeny demonstrated that only one allele for each species is infectious. The infectious allele of eBSImV could not be identified since the two alleles are identical. Finally, we demonstrate that eBSGFV and eBSOLV are located on chromosome 1 and eBSImV is located on chromosome 2 of the reference Musa genome published recently. The structure and evolution of eBSVs suggest sequential integration into the plant genome, and haplotype divergence analysis confirms that the three loci display differential evolution. Based on our data, we propose a model for BSV integration and eBSV evolution in the Musa balbisiana genome. The mutual benefits of this unique host-pathogen association are also discussed.
Collapse
|
39
|
Moreau H, Verhelst B, Couloux A, Derelle E, Rombauts S, Grimsley N, Van Bel M, Poulain J, Katinka M, Hohmann-Marriott MF, Piganeau G, Rouzé P, Da Silva C, Wincker P, Van de Peer Y, Vandepoele K. Gene functionalities and genome structure in Bathycoccus prasinos reflect cellular specializations at the base of the green lineage. Genome Biol 2012; 13:R74. [PMID: 22925495 PMCID: PMC3491373 DOI: 10.1186/gb-2012-13-8-r74] [Citation(s) in RCA: 112] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2012] [Accepted: 08/24/2012] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Bathycoccus prasinos is an extremely small cosmopolitan marine green alga whose cells are covered with intricate spider's web patterned scales that develop within the Golgi cisternae before their transport to the cell surface. The objective of this work is to sequence and analyze its genome, and to present a comparative analysis with other known genomes of the green lineage. RESEARCH Its small genome of 15 Mb consists of 19 chromosomes and lacks transposons. Although 70% of all B. prasinos genes share similarities with other Viridiplantae genes, up to 428 genes were probably acquired by horizontal gene transfer, mainly from other eukaryotes. Two chromosomes, one big and one small, are atypical, an unusual synapomorphic feature within the Mamiellales. Genes on these atypical outlier chromosomes show lower GC content and a significant fraction of putative horizontal gene transfer genes. Whereas the small outlier chromosome lacks colinearity with other Mamiellales and contains many unknown genes without homologs in other species, the big outlier shows a higher intron content, increased expression levels and a unique clustering pattern of housekeeping functionalities. Four gene families are highly expanded in B. prasinos, including sialyltransferases, sialidases, ankyrin repeats and zinc ion-binding genes, and we hypothesize that these genes are associated with the process of scale biogenesis. CONCLUSION The minimal genomes of the Mamiellophyceae provide a baseline for evolutionary and functional analyses of metabolic processes in green plants.
Collapse
|
40
|
Abstract
Evolutionary genomics is a field that relies heavily upon comparing genomes, that is, the full complement of genes of one species with another. However, given a genome sequence and little else, as is now often the case, genes must first be found and annotated before downstream analyses can be done. Computational gene prediction techniques are brought to bear on the problem of constructing a genome annotation as manual annotation is extremely time-consuming and costly. This chapter reviews the methods by which the individual components of a typical gene structure are detected in genomic sequence and then discusses several popular statistical frameworks for integrated gene prediction on eukaryotic genome sequences.
Collapse
Affiliation(s)
- Tyler Alioto
- Centro Nacional de Análisis Genómico, Barcelona, Spain.
| |
Collapse
|
41
|
Amselem J, Cuomo CA, van Kan JAL, Viaud M, Benito EP, Couloux A, Coutinho PM, de Vries RP, Dyer PS, Fillinger S, Fournier E, Gout L, Hahn M, Kohn L, Lapalu N, Plummer KM, Pradier JM, Quévillon E, Sharon A, Simon A, ten Have A, Tudzynski B, Tudzynski P, Wincker P, Andrew M, Anthouard V, Beever RE, Beffa R, Benoit I, Bouzid O, Brault B, Chen Z, Choquer M, Collémare J, Cotton P, Danchin EG, Da Silva C, Gautier A, Giraud C, Giraud T, Gonzalez C, Grossetete S, Güldener U, Henrissat B, Howlett BJ, Kodira C, Kretschmer M, Lappartient A, Leroch M, Levis C, Mauceli E, Neuvéglise C, Oeser B, Pearson M, Poulain J, Poussereau N, Quesneville H, Rascle C, Schumacher J, Ségurens B, Sexton A, Silva E, Sirven C, Soanes DM, Talbot NJ, Templeton M, Yandava C, Yarden O, Zeng Q, Rollins JA, Lebrun MH, Dickman M. Genomic analysis of the necrotrophic fungal pathogens Sclerotinia sclerotiorum and Botrytis cinerea. PLoS Genet 2011; 7:e1002230. [PMID: 21876677 PMCID: PMC3158057 DOI: 10.1371/journal.pgen.1002230] [Citation(s) in RCA: 683] [Impact Index Per Article: 48.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2011] [Accepted: 06/22/2011] [Indexed: 12/03/2022] Open
Abstract
Sclerotinia sclerotiorum and Botrytis cinerea are closely related necrotrophic plant pathogenic fungi notable for their wide host ranges and environmental persistence. These attributes have made these species models for understanding the complexity of necrotrophic, broad host-range pathogenicity. Despite their similarities, the two species differ in mating behaviour and the ability to produce asexual spores. We have sequenced the genomes of one strain of S. sclerotiorum and two strains of B. cinerea. The comparative analysis of these genomes relative to one another and to other sequenced fungal genomes is provided here. Their 38-39 Mb genomes include 11,860-14,270 predicted genes, which share 83% amino acid identity on average between the two species. We have mapped the S. sclerotiorum assembly to 16 chromosomes and found large-scale co-linearity with the B. cinerea genomes. Seven percent of the S. sclerotiorum genome comprises transposable elements compared to <1% of B. cinerea. The arsenal of genes associated with necrotrophic processes is similar between the species, including genes involved in plant cell wall degradation and oxalic acid production. Analysis of secondary metabolism gene clusters revealed an expansion in number and diversity of B. cinerea-specific secondary metabolites relative to S. sclerotiorum. The potential diversity in secondary metabolism might be involved in adaptation to specific ecological niches. Comparative genome analysis revealed the basis of differing sexual mating compatibility systems between S. sclerotiorum and B. cinerea. The organization of the mating-type loci differs, and their structures provide evidence for the evolution of heterothallism from homothallism. These data shed light on the evolutionary and mechanistic bases of the genetically complex traits of necrotrophic pathogenicity and sexual mating. This resource should facilitate the functional studies designed to better understand what makes these fungi such successful and persistent pathogens of agronomic crops.
Collapse
Affiliation(s)
- Joelle Amselem
- Unité de Recherche Génomique – Info, UR1164, INRA, Versailles, France
- Biologie et Gestion des Risques en Agriculture – Champignons Pathogènes des Plantes, UR1290, INRA, Grignon, France
| | - Christina A. Cuomo
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Jan A. L. van Kan
- Laboratory of Phytopathology, Wageningen University, Wageningen, The Netherlands
| | - Muriel Viaud
- Biologie et Gestion des Risques en Agriculture – Champignons Pathogènes des Plantes, UR1290, INRA, Grignon, France
| | - Ernesto P. Benito
- Departamento de Microbiología y Genética, Centro Hispano-Luso de Investigaciones Agrarias, Universidad de Salamanca, Salamanca, Spain
| | | | - Pedro M. Coutinho
- Architecture et Fonction des Macromolécules Biologiques, UMR6098, CNRS – Université de la Méditerranée et Université de Provence, Marseille, France
| | - Ronald P. de Vries
- Microbiology and Kluyver Centre for Genomics of Industrial Fermentations, Utrecht, The Netherlands
- CBS-KNAW Fungal Biodiversity Centre, Utrecht, The Netherlands
| | - Paul S. Dyer
- School of Biology, University of Nottingham, Nottingham, United Kingdom
| | - Sabine Fillinger
- Biologie et Gestion des Risques en Agriculture – Champignons Pathogènes des Plantes, UR1290, INRA, Grignon, France
| | - Elisabeth Fournier
- Biologie et Gestion des Risques en Agriculture – Champignons Pathogènes des Plantes, UR1290, INRA, Grignon, France
- Biologie et Génétique des Interactions Plante-Parasite, CIRAD – INRA – SupAgro, Montpellier, France
| | - Lilian Gout
- Biologie et Gestion des Risques en Agriculture – Champignons Pathogènes des Plantes, UR1290, INRA, Grignon, France
| | - Matthias Hahn
- Faculty of Biology, Kaiserslautern University, Kaiserslautern, Germany
| | - Linda Kohn
- Biology Department, University of Toronto, Mississauga, Canada
| | - Nicolas Lapalu
- Unité de Recherche Génomique – Info, UR1164, INRA, Versailles, France
| | - Kim M. Plummer
- Botany Department, La Trobe University, Melbourne, Australia
| | - Jean-Marc Pradier
- Biologie et Gestion des Risques en Agriculture – Champignons Pathogènes des Plantes, UR1290, INRA, Grignon, France
| | - Emmanuel Quévillon
- Unité de Recherche Génomique – Info, UR1164, INRA, Versailles, France
- Laboratoire de Génomique Fonctionnelle des Champignons Pathogènes de Plantes, UMR5240, Université de Lyon 1 – CNRS – BAYER S.A.S., Lyon, France
| | - Amir Sharon
- Department of Molecular Biology and Ecology of Plants, Tel Aviv University, Tel Aviv, Israel
| | - Adeline Simon
- Biologie et Gestion des Risques en Agriculture – Champignons Pathogènes des Plantes, UR1290, INRA, Grignon, France
| | - Arjen ten Have
- Instituto de Investigaciones Biologicas – CONICET, Universidad Nacional de Mar del Plata, Mar del Plata, Argentina
| | - Bettina Tudzynski
- Molekularbiologie und Biotechnologie der Pilze, Institut für Biologie und Biotechnologie der Pflanzen, Münster, Germany
| | - Paul Tudzynski
- Molekularbiologie und Biotechnologie der Pilze, Institut für Biologie und Biotechnologie der Pflanzen, Münster, Germany
| | | | - Marion Andrew
- Biology Department, University of Toronto, Mississauga, Canada
| | | | | | - Rolland Beffa
- Laboratoire de Génomique Fonctionnelle des Champignons Pathogènes de Plantes, UMR5240, Université de Lyon 1 – CNRS – BAYER S.A.S., Lyon, France
| | - Isabelle Benoit
- Microbiology and Kluyver Centre for Genomics of Industrial Fermentations, Utrecht, The Netherlands
| | - Ourdia Bouzid
- Microbiology and Kluyver Centre for Genomics of Industrial Fermentations, Utrecht, The Netherlands
| | - Baptiste Brault
- Unité de Recherche Génomique – Info, UR1164, INRA, Versailles, France
- Biologie et Gestion des Risques en Agriculture – Champignons Pathogènes des Plantes, UR1290, INRA, Grignon, France
| | - Zehua Chen
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Mathias Choquer
- Biologie et Gestion des Risques en Agriculture – Champignons Pathogènes des Plantes, UR1290, INRA, Grignon, France
- Laboratoire de Génomique Fonctionnelle des Champignons Pathogènes de Plantes, UMR5240, Université de Lyon 1 – CNRS – BAYER S.A.S., Lyon, France
| | - Jérome Collémare
- Laboratory of Phytopathology, Wageningen University, Wageningen, The Netherlands
- Laboratoire de Génomique Fonctionnelle des Champignons Pathogènes de Plantes, UMR5240, Université de Lyon 1 – CNRS – BAYER S.A.S., Lyon, France
| | - Pascale Cotton
- Laboratoire de Génomique Fonctionnelle des Champignons Pathogènes de Plantes, UMR5240, Université de Lyon 1 – CNRS – BAYER S.A.S., Lyon, France
| | - Etienne G. Danchin
- Interactions Biotiques et Santé Plantes, UMR5240, INRA – Université de Nice Sophia-Antipolis – CNRS, Sophia-Antipolis, France
| | | | - Angélique Gautier
- Biologie et Gestion des Risques en Agriculture – Champignons Pathogènes des Plantes, UR1290, INRA, Grignon, France
| | - Corinne Giraud
- Biologie et Gestion des Risques en Agriculture – Champignons Pathogènes des Plantes, UR1290, INRA, Grignon, France
| | - Tatiana Giraud
- Laboratoire d'Ecologie, Systématique et Evolution, Université Paris-Sud – CNRS – AgroParisTech, Orsay, France
| | - Celedonio Gonzalez
- Departamento de Bioquímica y Biología Molecular, Universidad de La Laguna, Tenerife, Spain
| | - Sandrine Grossetete
- Laboratoire de Génomique Fonctionnelle des Champignons Pathogènes de Plantes, UMR5240, Université de Lyon 1 – CNRS – BAYER S.A.S., Lyon, France
| | - Ulrich Güldener
- Helmholtz Zentrum München, German Research Center for Environmental Health, Institute of Bioinformatics and Systems Biology, Neuherberg, Germany
| | - Bernard Henrissat
- Architecture et Fonction des Macromolécules Biologiques, UMR6098, CNRS – Université de la Méditerranée et Université de Provence, Marseille, France
| | | | - Chinnappa Kodira
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | | | - Anne Lappartient
- Laboratoire de Génomique Fonctionnelle des Champignons Pathogènes de Plantes, UMR5240, Université de Lyon 1 – CNRS – BAYER S.A.S., Lyon, France
| | - Michaela Leroch
- Faculty of Biology, Kaiserslautern University, Kaiserslautern, Germany
| | - Caroline Levis
- Biologie et Gestion des Risques en Agriculture – Champignons Pathogènes des Plantes, UR1290, INRA, Grignon, France
| | - Evan Mauceli
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Cécile Neuvéglise
- Biologie Intégrative du Métabolisme Lipidique Microbien, UMR1319, INRA – Micalis – AgroParisTech, Thiverval-Grignon, France
| | - Birgitt Oeser
- Molekularbiologie und Biotechnologie der Pilze, Institut für Biologie und Biotechnologie der Pflanzen, Münster, Germany
| | - Matthew Pearson
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Julie Poulain
- GENOSCOPE, Centre National de Séquençage, Evry, France
| | - Nathalie Poussereau
- Laboratoire de Génomique Fonctionnelle des Champignons Pathogènes de Plantes, UMR5240, Université de Lyon 1 – CNRS – BAYER S.A.S., Lyon, France
| | - Hadi Quesneville
- Unité de Recherche Génomique – Info, UR1164, INRA, Versailles, France
| | - Christine Rascle
- Laboratoire de Génomique Fonctionnelle des Champignons Pathogènes de Plantes, UMR5240, Université de Lyon 1 – CNRS – BAYER S.A.S., Lyon, France
| | - Julia Schumacher
- Molekularbiologie und Biotechnologie der Pilze, Institut für Biologie und Biotechnologie der Pflanzen, Münster, Germany
| | | | - Adrienne Sexton
- School of Botany, University of Melbourne, Melbourne, Australia
| | - Evelyn Silva
- Fundacion Ciencia para la Vida and Facultad de Ciencias Biologicas, Universidad Andres Bello, Santiago, Chile
| | - Catherine Sirven
- Laboratoire de Génomique Fonctionnelle des Champignons Pathogènes de Plantes, UMR5240, Université de Lyon 1 – CNRS – BAYER S.A.S., Lyon, France
| | - Darren M. Soanes
- School of Biosciences, University of Exeter, Exeter, United Kingdom
| | | | - Matt Templeton
- Plant and Food Research, Mt. Albert Research Centre, Auckland, New Zealand
| | - Chandri Yandava
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Oded Yarden
- Department of Plant Pathology and Microbiology, Hebrew University Jerusalem, Rehovot, Israel
| | - Qiandong Zeng
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Jeffrey A. Rollins
- Department of Plant Pathology, University of Florida, Gainesville, Florida, United States of America
| | - Marc-Henri Lebrun
- Unité de Recherche Génomique – Info, UR1164, INRA, Versailles, France
- Biologie et Gestion des Risques en Agriculture – Champignons Pathogènes des Plantes, UR1290, INRA, Grignon, France
- Laboratoire de Génomique Fonctionnelle des Champignons Pathogènes de Plantes, UMR5240, Université de Lyon 1 – CNRS – BAYER S.A.S., Lyon, France
| | - Marty Dickman
- Institute for Plant Genomics and Biotechnology, Borlaug Genomics and Bioinformatics Center, Department of Plant Pathology and Microbiology, Texas A&M University, College Station, Texas, United States of America
| |
Collapse
|
42
|
Ahmed F, Benedito VA, Zhao PX. Mining Functional Elements in Messenger RNAs: Overview, Challenges, and Perspectives. FRONTIERS IN PLANT SCIENCE 2011; 2:84. [PMID: 22639614 PMCID: PMC3355573 DOI: 10.3389/fpls.2011.00084] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/31/2011] [Accepted: 11/03/2011] [Indexed: 05/03/2023]
Abstract
Eukaryotic messenger RNA (mRNA) contains not only protein-coding regions but also a plethora of functional cis-elements that influence or coordinate a number of regulatory aspects of gene expression, such as mRNA stability, splicing forms, and translation rates. Understanding the rules that apply to each of these element types (e.g., whether the element is defined by primary or higher-order structure) allows for the discovery of novel mechanisms of gene expression as well as the design of transcripts with controlled expression. Bioinformatics plays a major role in creating databases and finding non-evident patterns governing each type of eukaryotic functional element. Much of what we currently know about mRNA regulatory elements in eukaryotes is derived from microorganism and animal systems, with the particularities of plant systems lagging behind. In this review, we provide a general introduction to the most well-known eukaryotic mRNA regulatory motifs (splicing regulatory elements, internal ribosome entry sites, iron-responsive elements, AU-rich elements, zipcodes, and polyadenylation signals) and describe available bioinformatics resources (databases and analysis tools) to analyze eukaryotic transcripts in search of functional elements, focusing on recent trends in bioinformatics methods and tool development. We also discuss future directions in the development of better computational tools based upon current knowledge of these functional elements. Improved computational tools would advance our understanding of the processes underlying gene regulations. We encourage plant bioinformaticians to turn their attention to this subject to help identify novel mechanisms of gene expression regulation using RNA motifs that have potentially evolved or diverged in plant species.
Collapse
Affiliation(s)
- Firoz Ahmed
- Bioinformatics Laboratory, Plant Biology Division, Samuel Roberts Noble FoundationArdmore, OK, USA
| | - Vagner A. Benedito
- Genetics and Developmental Biology, Plant and Soil Sciences Division, West Virginia UniversityMorgantown, WV, USA
| | - Patrick Xuechun Zhao
- Bioinformatics Laboratory, Plant Biology Division, Samuel Roberts Noble FoundationArdmore, OK, USA
- *Correspondence: Patrick Xuechun Zhao, Bioinformatics Laboratory, Plant Biology Division, Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK 73401, USA e-mail:
| |
Collapse
|
43
|
Abstract
We sequenced and assembled the draft genome of Theobroma cacao, an economically important tropical-fruit tree crop that is the source of chocolate. This assembly corresponds to 76% of the estimated genome size and contains almost all previously described genes, with 82% of these genes anchored on the 10 T. cacao chromosomes. Analysis of this sequence information highlighted specific expansion of some gene families during evolution, for example, flavonoid-related genes. It also provides a major source of candidate genes for T. cacao improvement. Based on the inferred paleohistory of the T. cacao genome, we propose an evolutionary scenario whereby the ten T. cacao chromosomes were shaped from an ancestor through eleven chromosome fusions.
Collapse
|
44
|
Nasibov E, Tunaboylu S. Classification of splice-junction sequences via weighted position specific scoring approach. Comput Biol Chem 2010; 34:293-9. [PMID: 21056007 DOI: 10.1016/j.compbiolchem.2010.10.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2010] [Accepted: 10/06/2010] [Indexed: 11/30/2022]
Abstract
The prediction of the complete structure of genes is one of the very important tasks of bioinformatics, especially in eukaryotes. A crucial part in the gene structure prediction is to determine the splice sites in the coding region. Identification of splice sites depends on the precise recognition of the boundaries between exons and introns of a given DNA sequence. This problem can be formulated as a classification of sequence elements into 'exon-intron' (EI), 'intron-exon' (IE) or 'None' (N) boundary classes. In this study we propose a new Weighted Position Specific Scoring Method (WPSSM) to recognize splice sites which uses a position-specific scoring matrix constructed by nucleotide base frequencies. A genetic algorithm is used in order to tune the weight and threshold parameters of the positions on. This method consists of two phases: learning phase and identification phase. The proposed WPSS method poses efficient results compared with the performance of many methods proposed in the literature. Computational experiments are performed on the DNA sequence datasets from 'UCI Repository of machine learning databases'.
Collapse
Affiliation(s)
- Efendi Nasibov
- Department of Computer Science, Dokuz Eylul University, Izmir, Turkey. efendi
| | | |
Collapse
|
45
|
Baurens FC, Bocs S, Rouard M, Matsumoto T, Miller RNG, Rodier-Goud M, MBéguié-A-MBéguié D, Yahiaoui N. Mechanisms of haplotype divergence at the RGA08 nucleotide-binding leucine-rich repeat gene locus in wild banana (Musa balbisiana). BMC PLANT BIOLOGY 2010; 10:149. [PMID: 20637079 PMCID: PMC3017797 DOI: 10.1186/1471-2229-10-149] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/16/2009] [Accepted: 07/16/2010] [Indexed: 05/09/2023]
Abstract
BACKGROUND Comparative sequence analysis of complex loci such as resistance gene analog clusters allows estimating the degree of sequence conservation and mechanisms of divergence at the intraspecies level. In banana (Musa sp.), two diploid wild species Musa acuminata (A genome) and Musa balbisiana (B genome) contribute to the polyploid genome of many cultivars. The M. balbisiana species is associated with vigour and tolerance to pests and disease and little is known on the genome structure and haplotype diversity within this species. Here, we compare two genomic sequences of 253 and 223 kb corresponding to two haplotypes of the RGA08 resistance gene analog locus in M. balbisiana "Pisang Klutuk Wulung" (PKW). RESULTS Sequence comparison revealed two regions of contrasting features. The first is a highly colinear gene-rich region where the two haplotypes diverge only by single nucleotide polymorphisms and two repetitive element insertions. The second corresponds to a large cluster of RGA08 genes, with 13 and 18 predicted RGA genes and pseudogenes spread over 131 and 152 kb respectively on each haplotype. The RGA08 cluster is enriched in repetitive element insertions, in duplicated non-coding intergenic sequences including low complexity regions and shows structural variations between haplotypes. Although some allelic relationships are retained, a large diversity of RGA08 genes occurs in this single M. balbisiana genotype, with several RGA08 paralogs specific to each haplotype. The RGA08 gene family has evolved by mechanisms of unequal recombination, intragenic sequence exchange and diversifying selection. An unequal recombination event taking place between duplicated non-coding intergenic sequences resulted in a different RGA08 gene content between haplotypes pointing out the role of such duplicated regions in the evolution of RGA clusters. Based on the synonymous substitution rate in coding sequences, we estimated a 1 million year divergence time for these M. balbisiana haplotypes. CONCLUSIONS A large RGA08 gene cluster identified in wild banana corresponds to a highly variable genomic region between haplotypes surrounded by conserved flanking regions. High level of sequence identity (70 to 99%) of the genic and intergenic regions suggests a recent and rapid evolution of this cluster in M. balbisiana.
Collapse
Affiliation(s)
| | - Stéphanie Bocs
- CIRAD, UMR DAP, TA A-96/03, Avenue Agropolis, F-34398 Montpellier Cedex 5, France
| | - Mathieu Rouard
- Bioversity International, Parc Scientifique Agropolis II, F-34397 Montpellier Cedex 5, France
| | - Takashi Matsumoto
- Rice Genome Research Program (RGP), National Institute of Agrobiological Sciences (NIAS)/Institute of the Society for Techno-innovation of Agriculture, Forestry and Fisheries, Tsukuba, Ibaraki 305-8602, Japan
| | - Robert NG Miller
- Postgraduate program in Genomic Science and Biotechnology, Universidade Católica de Brasília, SGAN 916, Módulo B, CEP 70.790-160, Brasília, DF, Brazil
- Universidade de Brasília, Campus Universitário Darcy Ribeiro, Instituto de Ciências Biológicas, Departamento de Biologia Celular, Asa Norte, Brasília, Brazil
| | | | | | - Nabila Yahiaoui
- CIRAD, UMR DAP, TA A-96/03, Avenue Agropolis, F-34398 Montpellier Cedex 5, France
| |
Collapse
|
46
|
Abstract
MOTIVATION A large part of the maize B73 genome sequence is now available and emerging sequencing technologies will offer cheap and easy ways to sequence areas of interest from many other maize genotypes. One of the steps required to turn these sequences into valuable information is gene content prediction. To date, there is no publicly available gene predictor specifically trained for maize sequences. To this end, we have chosen to train the EuGène software that can combine several sources of evidence into a consolidated gene model prediction. AVAILABILITY http://genome.jouy.inra.fr/eugene/cgi-bin/eugene_form.pl.
Collapse
Affiliation(s)
- Pierre Montalent
- INRA, UMR 0320 / UMR 8120 Génétique Végétale, Gif-sur-Yvette, France
| | | |
Collapse
|
47
|
Sinha R, Zimmer AD, Bolte K, Lang D, Reski R, Platzer M, Rensing SA, Backofen R. Identification and characterization of NAGNAG alternative splicing in the moss Physcomitrella patens. BMC PLANT BIOLOGY 2010; 10:76. [PMID: 20426810 PMCID: PMC3095350 DOI: 10.1186/1471-2229-10-76] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/15/2009] [Accepted: 04/28/2010] [Indexed: 05/05/2023]
Abstract
BACKGROUND Alternative splicing (AS) involving tandem acceptors that are separated by three nucleotides (NAGNAG) is an evolutionarily widespread class of AS, which is well studied in Homo sapiens (human) and Mus musculus (mouse). It has also been shown to be common in the model seed plants Arabidopsis thaliana and Oryza sativa (rice). In one of the first studies involving sequence-based prediction of AS in plants, we performed a genome-wide identification and characterization of NAGNAG AS in the model plant Physcomitrella patens, a moss. RESULTS Using Sanger data, we found 295 alternatively used NAGNAG acceptors in P. patens. Using 31 features and training and test datasets of constitutive and alternative NAGNAGs, we trained a classifier to predict the splicing outcome at NAGNAG tandem splice sites (alternative splicing, constitutive at the first acceptor, or constitutive at the second acceptor). Our classifier achieved a balanced specificity and sensitivity of >or= 89%. Subsequently, a classifier trained exclusively on data well supported by transcript evidence was used to make genome-wide predictions of NAGNAG splicing outcomes. By generation of more transcript evidence from a next-generation sequencing platform (Roche 454), we found additional evidence for NAGNAG AS, with altogether 664 alternative NAGNAGs being detected in P. patens using all currently available transcript evidence. The 454 data also enabled us to validate the predictions of the classifier, with 64% (80/125) of the well-supported cases of AS being predicted correctly. CONCLUSION NAGNAG AS is just as common in the moss P. patens as it is in the seed plants A. thaliana and O. sativa (but not conserved on the level of orthologous introns), and can be predicted with high accuracy. The most informative features are the nucleotides in the NAGNAG and in its immediate vicinity, along with the splice sites scores, as found earlier for NAGNAG AS in animals. Our results suggest that the mechanism behind NAGNAG AS in plants is similar to that in animals and is largely dependent on the splice site and its immediate neighborhood.
Collapse
Affiliation(s)
- Rileen Sinha
- Bioinformatics group, University of Freiburg, Georges-Koehler-Allee 106, 79110 Freiburg, Germany
- Centre for Biological Signalling Studies (bioss), University of Freiburg, Albertstr. 19, 79104 Freiburg, Germany
| | - Andreas D Zimmer
- Faculty of Biology, University of Freiburg, Hauptstrasse 1, 79104 Freiburg, Germany
- Plant Biotechnology, Faculty of Biology, University of Freiburg, Schaenzlestrasse 1, 79104 Freiburg, Germany
| | - Kathrin Bolte
- Faculty of Biology, University of Freiburg, Hauptstrasse 1, 79104 Freiburg, Germany
- Freiburg Initiative for Systems Biology (FRISYS), University of Freiburg, Schaenzlestrasse 1, 79104 Freiburg, Germany
- Philipps-Universität Marburg, Laboratorium für Zellbiologie, Karl-von-Frisch Str., 35032 Marburg, Germany
| | - Daniel Lang
- Faculty of Biology, University of Freiburg, Hauptstrasse 1, 79104 Freiburg, Germany
- Plant Biotechnology, Faculty of Biology, University of Freiburg, Schaenzlestrasse 1, 79104 Freiburg, Germany
| | - Ralf Reski
- Plant Biotechnology, Faculty of Biology, University of Freiburg, Schaenzlestrasse 1, 79104 Freiburg, Germany
- Freiburg Initiative for Systems Biology (FRISYS), University of Freiburg, Schaenzlestrasse 1, 79104 Freiburg, Germany
- Centre for Biological Signalling Studies (bioss), University of Freiburg, Albertstr. 19, 79104 Freiburg, Germany
| | - Matthias Platzer
- Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr. 11, 07745 Jena, Germany
| | - Stefan A Rensing
- Faculty of Biology, University of Freiburg, Hauptstrasse 1, 79104 Freiburg, Germany
- Freiburg Initiative for Systems Biology (FRISYS), University of Freiburg, Schaenzlestrasse 1, 79104 Freiburg, Germany
- Centre for Biological Signalling Studies (bioss), University of Freiburg, Albertstr. 19, 79104 Freiburg, Germany
| | - Rolf Backofen
- Bioinformatics group, University of Freiburg, Georges-Koehler-Allee 106, 79110 Freiburg, Germany
- Freiburg Initiative for Systems Biology (FRISYS), University of Freiburg, Schaenzlestrasse 1, 79104 Freiburg, Germany
- Centre for Biological Signalling Studies (bioss), University of Freiburg, Albertstr. 19, 79104 Freiburg, Germany
| |
Collapse
|
48
|
SpliceIT: a hybrid method for splice signal identification based on probabilistic and biological inference. J Biomed Inform 2009; 43:208-17. [PMID: 19800027 DOI: 10.1016/j.jbi.2009.09.004] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2008] [Revised: 08/25/2009] [Accepted: 09/21/2009] [Indexed: 11/23/2022]
Abstract
Splice sites define the boundaries of exonic regions and dictate protein synthesis and function. The splicing mechanism involves complex interactions among positional and compositional features of different lengths. Computational modeling of the underlying constructive information is especially challenging, in order to decipher splicing-inducing elements and alternative splicing factors. SpliceIT (Splice Identification Technique) introduces a hybrid method for splice site prediction that couples probabilistic modeling with discriminative computational or experimental features inferred from published studies in two subsequent classification steps. The first step is undertaken by a Gaussian support vector machine (SVM) trained on the probabilistic profile that is extracted using two alternative position-dependent feature selection methods. In the second step, the extracted predictions are combined with known species-specific regulatory elements, in order to induce a tree-based modeling. The performance evaluation on human and Arabidopsis thaliana splice site datasets shows that SpliceIT is highly accurate compared to current state-of-the-art predictors in terms of the maximum sensitivity, specificity tradeoff without compromising space complexity and in a time-effective way. The source code and supplementary material are available at: http://www.med.auth.gr/research/spliceit/.
Collapse
|
49
|
Varadwaj P, Purohit N, Arora B. Detection of Splice Sites Using Support Vector Machine. COMMUNICATIONS IN COMPUTER AND INFORMATION SCIENCE 2009. [DOI: 10.1007/978-3-642-03547-0_47] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
50
|
Baten AKMA, Halgamuge SK, Chang BCH. Fast splice site detection using information content and feature reduction. BMC Bioinformatics 2008; 9 Suppl 12:S8. [PMID: 19091031 PMCID: PMC2638148 DOI: 10.1186/1471-2105-9-s12-s8] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Accurate identification of splice sites in DNA sequences plays a key role in the prediction of gene structure in eukaryotes. Already many computational methods have been proposed for the detection of splice sites and some of them showed high prediction accuracy. However, most of these methods are limited in terms of their long computation time when applied to whole genome sequence data. RESULTS In this paper we propose a hybrid algorithm which combines several effective and informative input features with the state of the art support vector machine (SVM). To obtain the input features we employ information content method based on Shannon's information theory, Shapiro's score scheme, and Markovian probabilities. We also use a feature elimination scheme to reduce the less informative features from the input data. CONCLUSION In this study we propose a new feature based splice site detection method that shows improved acceptor and donor splice site detection in DNA sequences when the performance is compared with various state of the art and well known methods.
Collapse
Affiliation(s)
- AKMA Baten
- Biomechanical Engineering Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, Victoria 3010, Australia
| | - SK Halgamuge
- Biomechanical Engineering Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, Victoria 3010, Australia
| | - BCH Chang
- Institute of Plant and Microbial Biology, Academia Sinica, Taiwan
| |
Collapse
|