1
|
Grešová K, Martinek V, Čechák D, Šimeček P, Alexiou P. Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genom Data 2023; 24:25. [PMID: 37127596 PMCID: PMC10150520 DOI: 10.1186/s12863-023-01123-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Accepted: 03/31/2023] [Indexed: 05/03/2023] Open
Abstract
BACKGROUND Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. RESULTS Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package 'genomic-benchmarks', and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . CONCLUSIONS Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
Collapse
Affiliation(s)
- Katarína Grešová
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
| | - Vlastimil Martinek
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
| | - David Čechák
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
| | - Petr Šimeček
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia.
| | - Panagiotis Alexiou
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
| |
Collapse
|
2
|
Zhang M, Jia C, Li F, Li C, Zhu Y, Akutsu T, Webb GI, Zou Q, Coin LJM, Song J. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Brief Bioinform 2022; 23:6502561. [PMID: 35021193 PMCID: PMC8921625 DOI: 10.1093/bib/bbab551] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 11/12/2021] [Accepted: 11/30/2021] [Indexed: 01/13/2023] Open
Abstract
Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli, Bacillus subtilis, Homo sapiens, Mus musculus, Arabidopsis thaliana, Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning-based approaches generally outperformed scoring function-based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future.
Collapse
Affiliation(s)
| | - Cangzhi Jia
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | | | | | | | | | - Geoffrey I Webb
- Department of Data Science and Artificial Intelligence, Monash University, Melbourne, VIC 3800, Australia,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Quan Zou
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Lachlan J M Coin
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Jiangning Song
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| |
Collapse
|
3
|
Bhandari N, Khare S, Walambe R, Kotecha K. Comparison of machine learning and deep learning techniques in promoter prediction across diverse species. PeerJ Comput Sci 2021; 7:e365. [PMID: 33817015 PMCID: PMC7959599 DOI: 10.7717/peerj-cs.365] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Accepted: 12/30/2020] [Indexed: 06/12/2023]
Abstract
Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems.
Collapse
Affiliation(s)
- Nikita Bhandari
- Computer Science, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, MH, India
| | - Satyajeet Khare
- Symbiosis School of Biological Sciences, Symbiosis International (Deemed University), Pune, MH, India
| | - Rahee Walambe
- Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Pune, Maharashtra, India
- Electronics and Telecommunication Dept, Symbiosis Institute of Technology, Pune, Maharashtra, India
| | - Ketan Kotecha
- Computer Science, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, MH, India
- Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Pune, Maharashtra, India
| |
Collapse
|
4
|
Abbas MM, Mohie-Eldin MM, EL-Manzalawy Y. Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors. PLoS One 2015; 10:e0119721. [PMID: 25803493 PMCID: PMC4372424 DOI: 10.1371/journal.pone.0119721] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Accepted: 01/26/2015] [Indexed: 11/27/2022] Open
Abstract
As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.
Collapse
Affiliation(s)
- Mostafa M. Abbas
- KINDI Center for Computing Research, College of Engineering, Qatar University, Doha, Qatar
| | | | - Yasser EL-Manzalawy
- Systems and Computer Engineering, Al-Azhar University, Cairo, Egypt
- College of Information Sciences, Penn State University, University Park, United States of America
| |
Collapse
|
5
|
Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY. Sample Subset Optimization Techniques for Imbalanced and Ensemble Learning Problems in Bioinformatics Applications. IEEE TRANSACTIONS ON CYBERNETICS 2014; 44:445-55. [PMID: 24108722 DOI: 10.1109/tcyb.2013.2257480] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Data sampling is a widely used technique in a broad range of machine learning problems. Traditional sampling approaches generally rely on random resampling from a given dataset. However, these approaches do not take into consideration additional information, such as sample quality and usefulness. We recently proposed a data sampling technique, called sample subset optimization (SSO). The SSO technique relies on a cross-validation procedure for identifying and selecting the most useful samples as subsets. In this paper, we describe the application of SSO techniques to imbalanced and ensemble learning problems, respectively. For imbalanced learning, the SSO technique is employed as an under-sampling technique for identifying a subset of highly discriminative samples in the majority class. In ensemble learning, the SSO technique is utilized as a generic ensemble technique where multiple optimized subsets of samples from each class are selected for building an ensemble classifier. We demonstrate the utilities and advantages of the proposed techniques on a variety of bioinformatics applications where class imbalance, small sample size, and noisy data are prevalent.
Collapse
|
6
|
de Avila e Silva S, Forte F, T S Sartor I, Andrighetti T, J L Gerhardt G, Longaray Delamare AP, Echeverrigaray S. DNA duplex stability as discriminative characteristic for Escherichia coli σ(54)- and σ(28)- dependent promoter sequences. Biologicals 2013; 42:22-8. [PMID: 24172230 DOI: 10.1016/j.biologicals.2013.10.001] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2013] [Accepted: 10/01/2013] [Indexed: 11/17/2022] Open
Abstract
The advent of modern high-throughput sequencing has made it possible to generate vast quantities of genomic sequence data. However, the processing of this volume of information, including prediction of gene-coding and regulatory sequences remains an important bottleneck in bioinformatics research. In this work, we integrated DNA duplex stability into the repertoire of a Neural Network (NN) capable of predicting promoter regions with augmented accuracy, specificity and sensitivity. We took our method beyond a simplistic analysis based on a single sigma subunit of RNA polymerase, incorporating the six main sigma-subunits of Escherichia coli. This methodology employed successfully re-discovered known promoter sequences recognized by E. coli RNA polymerase subunits σ(24), σ(28), σ(32), σ(38), σ(54) and σ(70), with highlighted accuracies for σ(28)- and σ(54)- dependent promoter sequences (values obtained were 80% and 78.8%, respectively). Furthermore, the discrimination of promoters according to the σ factor made it possible to extract functional commonalities for the genes expressed by each type of promoter. The DNA duplex stability rises as a distinctive feature which improves the recognition and classification of σ(28)- and σ(54)- dependent promoter sequences. The findings presented in this report underscore the usefulness of including DNA biophysical parameters into NN learning algorithms to increase accuracy, specificity and sensitivity in promoter beyond what is accomplished based on sequence alone.
Collapse
Affiliation(s)
- Scheila de Avila e Silva
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| | - Franciele Forte
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| | - Ivaine T S Sartor
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| | - Tahila Andrighetti
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| | - Günther J L Gerhardt
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| | - Ana Paula Longaray Delamare
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| | - Sergio Echeverrigaray
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| |
Collapse
|
7
|
BATUWITA RUKSHAN, PALADE VASILE. ADJUSTED GEOMETRIC-MEAN: A NOVEL PERFORMANCE MEASURE FOR IMBALANCED BIOINFORMATICS DATASETS LEARNING. J Bioinform Comput Biol 2012; 10:1250003. [DOI: 10.1142/s0219720012500035] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
One common and challenging problem faced by many bioinformatics applications, such as promoter recognition, splice site prediction, RNA gene prediction, drug discovery and protein classification, is the imbalance of the available datasets. In most of these applications, the positive data examples are largely outnumbered by the negative data examples, which often leads to the development of sub-optimal prediction models having high negative recognition rate (Specificity = SP) and low positive recognition rate (Sensitivity = SE). When class imbalance learning methods are applied, usually, the SE is increased at the expense of reducing some amount of the SP. In this paper, we point out that in these data-imbalanced bioinformatics applications, the goal of applying class imbalance learning methods would be to increase the SE as high as possible by keeping the reduction of SP as low as possible. We explain that the existing performance measures used in class imbalance learning can still produce sub-optimal models with respect to this classification goal. In order to overcome these problems, we introduce a new performance measure called Adjusted Geometric-mean (AGm). The experimental results obtained on ten real-world imbalanced bioinformatics datasets demonstrates that the AGm metric can achieve a lower rate of reduction of SP than the existing performance metrics, when increasing the SE through class imbalance learning methods. This characteristic of AGm metric makes it more suitable for achieving the proposed classification goal in imbalanced bioinformatics datasets learning.
Collapse
Affiliation(s)
- RUKSHAN BATUWITA
- University of Oxford, Department of Computer Science, Oxford, OX1 3QD, United Kingdom
| | - VASILE PALADE
- University of Oxford, Department of Computer Science, Oxford, OX1 3QD, United Kingdom
| |
Collapse
|
8
|
de Avila e Silva S, Echeverrigaray S, Gerhardt GJ. BacPP: Bacterial promoter prediction—A tool for accurate sigma-factor specific assignment in enterobacteria. J Theor Biol 2011; 287:92-9. [DOI: 10.1016/j.jtbi.2011.07.017] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2010] [Revised: 05/20/2011] [Accepted: 07/21/2011] [Indexed: 10/17/2022]
|
9
|
Maucher M, Kracher B, Kühl M, Kestler HA. Inferring Boolean network structure via correlation. ACTA ACUST UNITED AC 2011; 27:1529-36. [PMID: 21471013 DOI: 10.1093/bioinformatics/btr166] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Accurate, context-specific regulation of gene expression is essential for all organisms. Accordingly, it is very important to understand the complex relations within cellular gene regulatory networks. A tool to describe and analyze the behavior of such networks are Boolean models. The reconstruction of a Boolean network from biological data requires identification of dependencies within the network. This task becomes increasingly computationally demanding with large amounts of data created by recent high-throughput technologies. Thus, we developed a method that is especially suited for network structure reconstruction from large-scale data. In our approach, we took advantage of the fact that a specific transcription factor often will consistently either activate or inhibit a specific target gene, and this kind of regulatory behavior can be modeled using monotone functions. RESULTS To detect regulatory dependencies in a network, we examined how the expression of different genes correlates to successive network states. For this purpose, we used Pearson correlation as an elementary correlation measure. Given a Boolean network containing only monotone Boolean functions, we prove that the correlation of successive states can identify the dependencies in the network. This method not only finds dependencies in randomly created artificial networks to very high percentage, but also reconstructed large fractions of both a published Escherichia coli regulatory network from simulated data and a yeast cell cycle network from real microarray data.
Collapse
Affiliation(s)
- Markus Maucher
- Research group Bioinformatics and Systems Biology, Clinic for Internal Medicine I, University Medical Center Ulm, Ulm, Germany
| | | | | | | |
Collapse
|
10
|
Eukaryotic and prokaryotic promoter prediction using hybrid approach. Theory Biosci 2010; 130:91-100. [DOI: 10.1007/s12064-010-0114-8] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2010] [Accepted: 10/23/2010] [Indexed: 12/27/2022]
|