1
|
Tsimenidis S, Vrochidou E, Papakostas GA. Omics Data and Data Representations for Deep Learning-Based Predictive Modeling. Int J Mol Sci 2022; 23:12272. [PMID: 36293133 PMCID: PMC9603455 DOI: 10.3390/ijms232012272] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 10/03/2022] [Accepted: 10/12/2022] [Indexed: 11/25/2022] Open
Abstract
Medical discoveries mainly depend on the capability to process and analyze biological datasets, which inundate the scientific community and are still expanding as the cost of next-generation sequencing technologies is decreasing. Deep learning (DL) is a viable method to exploit this massive data stream since it has advanced quickly with there being successive innovations. However, an obstacle to scientific progress emerges: the difficulty of applying DL to biology, and this because both fields are evolving at a breakneck pace, thus making it hard for an individual to occupy the front lines of both of them. This paper aims to bridge the gap and help computer scientists bring their valuable expertise into the life sciences. This work provides an overview of the most common types of biological data and data representations that are used to train DL models, with additional information on the models themselves and the various tasks that are being tackled. This is the essential information a DL expert with no background in biology needs in order to participate in DL-based research projects in biomedicine, biotechnology, and drug discovery. Alternatively, this study could be also useful to researchers in biology to understand and utilize the power of DL to gain better insights into and extract important information from the omics data.
Collapse
Affiliation(s)
| | | | - George A. Papakostas
- MLV Research Group, Department of Computer Science, International Hellenic University, 65404 Kavala, Greece
| |
Collapse
|
2
|
Alkady W, ElBahnasy K, Leiva V, Gad W. Classifying COVID-19 based on amino acids encoding with machine learning algorithms. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS : AN INTERNATIONAL JOURNAL SPONSORED BY THE CHEMOMETRICS SOCIETY 2022; 224:104535. [PMID: 35308181 PMCID: PMC8923015 DOI: 10.1016/j.chemolab.2022.104535] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2021] [Revised: 02/27/2022] [Accepted: 03/05/2022] [Indexed: 05/12/2023]
Abstract
COVID-19 disease causes serious respiratory illnesses. Therefore, accurate identification of the viral infection cycle plays a key role in designing appropriate vaccines. The risk of this disease depends on proteins that interact with human receptors. In this paper, we formulate a novel model for COVID-19 named "amino acid encoding based prediction" (AAPred). This model is accurate, classifies the various coronavirus types, and distinguishes SARS-CoV-2 from other coronaviruses. With the AAPred model, we reduce the number of features to enhance its performance by selecting the most important ones employing statistical criteria. The protein sequence of SARS-CoV-2 for understanding the viral infection cycle is analyzed. Six machine learning classifiers related to decision trees, k-nearest neighbors, random forest, support vector machine, bagging ensemble, and gradient boosting are used to evaluate the model in terms of accuracy, precision, sensitivity, and specificity. We implement the obtained results computationally and apply them to real data from the National Genomics Data Center. The experimental results report that the AAPred model reduces the features to seven of them. The average accuracy of the 10-fold cross-validation is 98.69%, precision is 98.72%, sensitivity is 96.81%, and specificity is 97.72%. The features are selected utilizing information gain and classified with random forest. The proposed model predicts the type of Coronavirus and reduces the number of extracted features. We identify that SARS-CoV-2 has similar physicochemical characteristics in some regions of SARS-CoV. Also, we report that SARS-CoV-2 has similar infection cycles and sequences in some regions of SARS CoV indicating the affectedness of vaccines on SARS-CoV-2. A comparison with deep learning shows similar results with our method.
Collapse
Affiliation(s)
- Walaa Alkady
- Department of Bioinformatics, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| | - Khaled ElBahnasy
- Department of Information Systems, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| | - Víctor Leiva
- School of Industrial Engineering, Pontificia Universidad Católica de Valparaíso, Valparaíso, Chile
| | - Walaa Gad
- Department of Information Systems, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| |
Collapse
|
3
|
Wang K, Zhao X, Wang X. A large-scale prediction of protein-protein interactions based on random forest and matrix of sequence. BIO WEB OF CONFERENCES 2022. [DOI: 10.1051/bioconf/20225501017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Protein-protein interaction (PPIs) is an important part of many life activities in organisms, and the prediction of protein-protein interactions is closely related to protein function, disease occurrence, and disease treatment. In order to optimize the prediction performance of protein interactions, here a RT-MOS model was constructed based on Random Forest (RF) and Matrix of Sequence (MOS) to predict protein-protein interactions. Firstly, MOS is used to encode the protein sequences into a 29-dimensional feature vector; Then, a prediction model RT-MOS is build based on random forest, and the RT-MOS model is optimized and evaluated using the test set; Finally, the optimized model RT-MOS is used for prediction. The experimental results show that the accuracy rates of the RT-MOS model on the benchmark dataset and the non-redundant dataset are 97.18% and 91.34%, respectively, and the accuracies on four external datasets of C.elegans, Drosophila, E.coli and H.sapiens are 96.21%, 97.86%, 97.54% and 97.75%, respectively. Compared with the existing methods, it is found that it is superior to the existing methods. The experimental results show that the model RT-MOS has the advantages of saving time, preventing overfitting and high accuracy, and is suitable for large-scale PPIs prediction.
Collapse
|
4
|
Halder AK, Bandyopadhyay SS, Chatterjee P, Nasipuri M, Plewczynski D, Basu S. JUPPI: A Multi-Level Feature Based Method for PPI Prediction and a Refined Strategy for Performance Assessment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:531-542. [PMID: 32750875 DOI: 10.1109/tcbb.2020.3004970] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Over the years, several methods have been proposed for the computational PPI prediction with different performance evaluation strategies. While attempting to benchmark performance scores, most of these methods often suffer with ill-treated cross-validation strategies, adhoc selection of positive/negative samples etc. To address these issues, in our proposed multi-level feature based PPI prediction approach (JUPPI), using sequence, domain and GO information as features, a refined evaluation strategy has been introduced. During the evaluation process, we first extract high quality negative data using three-stage filtering, and then introduce a pair-input based cross validation strategy with three difficulty levels for test-set predictions. Our proposed evaluation strategy reduces the component-level overlapping issue in test sets. Performance of JUPPI is compared with those of the state-of-the-art approaches in this domain and tested on six independent PPI datasets. In almost all the datasets, JUPPI outperforms the state-of-the-art not only at human proteome level for PPI prediction, but also for prediction of interactors for intrinsic disordered human proteins. https://figshare.com/projects/JUPPI_A_Multi-level_Feature_Based_Method_for_PPI_Prediction_and_a_Refined_Strategy_for_Performance_Assessment/81656 JUPPI tool and the developed datasets (JUPPId) are available in public domain for academic use along with supplementary materials, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2020.3004970.
Collapse
|
5
|
Mahapatra S, Gupta VR, Sahu SS, Panda G. Deep Neural Network and Extreme Gradient Boosting Based Hybrid Classifier for Improved Prediction of Protein-Protein Interaction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:155-165. [PMID: 33621179 DOI: 10.1109/tcbb.2021.3061300] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Understanding the behavioral process of life and disease-causing mechanism, knowledge regarding protein-protein interactions (PPI) is essential. In this paper, a novel hybrid approach combining deep neural network (DNN) and extreme gradient boosting classifier (XGB) is employed for predicting PPI. The hybrid classifier (DNN-XGB) uses a fusion of three sequence-based features, amino acid composition (AAC), conjoint triad composition (CT), and local descriptor (LD) as inputs. The DNN extracts the hidden information through a layer-wise abstraction from the raw features that are passed through the XGB classifier. The 5-fold cross-validation accuracy for intraspecies interactions dataset of Saccharomyces cerevisiae (core subset), Helicobacter pylori, Saccharomyces cerevisiae, and Human are 98.35, 96.19, 97.37, and 99.74 percent respectively. Similarly, accuracies of 98.50 and 97.25 percent are achieved for interspecies interaction dataset of Human- Bacillus Anthracis and Human- Yersinia pestis datasets, respectively. The improved prediction accuracies obtained on the independent test sets and network datasets indicate that the DNN-XGB can be used to predict cross-species interactions. It can also provide new insights into signaling pathway analysis, predicting drug targets, and understanding disease pathogenesis. Improved performance of the proposed method suggests that the hybrid classifier can be used as a useful tool for PPI prediction. The datasets and source codes are available at: https://github.com/SatyajitECE/DNN-XGB-for-PPI-Prediction.
Collapse
|
6
|
Wekesa JS, Meng J, Luan Y. A deep learning model for plant lncRNA-protein interaction prediction with graph attention. Mol Genet Genomics 2020; 295:1091-1102. [DOI: 10.1007/s00438-020-01682-w] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2020] [Accepted: 05/01/2020] [Indexed: 02/06/2023]
|
7
|
Gui YM, Wang RJ, Wang X, Wei YY. Using Deep Neural Networks to Improve the Performance of Protein–Protein Interactions Prediction. INT J PATTERN RECOGN 2020. [DOI: 10.1142/s0218001420520126] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Protein–protein interactions (PPIs) help to elucidate the molecular mechanisms of life activities and have a certain role in promoting disease treatment and new drug development. With the advent of the proteomics era, some PPIs prediction methods have emerged. However, the performances of these PPIs prediction methods still need to be optimized and improved. In order to optimize the performance of the PPIs prediction methods, we used the dropout method to reduce over-fitting by deep neural networks (DNNs), and combined with three types of feature extraction methods, conjoint triad (CT), auto covariance (AC) and local descriptor (LD), to build DNN models based on amino acid sequences. The results showed that the accuracy of the CT, AC and LD increased from 97.11% to 98.12%, 96.84% to 98.17%, and 95.30% to 95.60%, respectively. The loss values of the CT, AC and LD decreased from 27.47% to 14.96%, 65.91% to 17.82% and 36.23% to 15.34%, respectively. Experimental results show that dropout can optimize the performances of the DNN models. The results can provide a resource for scholars in future studies involving the prediction of PPIs. The experimental code is available at https://github.com/smalltalkman/hppi-tensorflow .
Collapse
Affiliation(s)
- Yuan-Miao Gui
- Institute of Intelligent Machines, Hefei Institute of Physics, Chinese Academy of Sciences, Hefei City, Anhui Province, P. R. China
- University of Science and Technology of China, Hefei City, Anhui Province, P. R. China
| | - Ru-Jing Wang
- Institute of Intelligent Machines, Hefei Institute of Physics, Chinese Academy of Sciences, Hefei City, Anhui Province, P. R. China
| | - Xue Wang
- Institute of Intelligent Machines, Hefei Institute of Physics, Chinese Academy of Sciences, Hefei City, Anhui Province, P. R. China
| | - Yuan-Yuan Wei
- Institute of Intelligent Machines, Hefei Institute of Physics, Chinese Academy of Sciences, Hefei City, Anhui Province, P. R. China
| |
Collapse
|