1
|
Khanduja A, Mohanty D. SProtFP: a machine learning-based method for functional classification of small ORFs in prokaryotes. NAR Genom Bioinform 2025; 7:lqae186. [PMID: 39781515 PMCID: PMC11704790 DOI: 10.1093/nargab/lqae186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Revised: 11/07/2024] [Accepted: 12/17/2024] [Indexed: 01/12/2025] Open
Abstract
Small proteins (≤100 amino acids) play important roles across all life forms, ranging from unicellular bacteria to higher organisms. In this study, we have developed SProtFP which is a machine learning-based method for functional annotation of prokaryotic small proteins into selected functional categories. SProtFP uses independent artificial neural networks (ANNs) trained using a combination of physicochemical descriptors for classifying small proteins into antitoxin type 2, bacteriocin, DNA-binding, metal-binding, ribosomal protein, RNA-binding, type 1 toxin and type 2 toxin proteins. We have also trained a model for identification of small open reading frame (smORF)-encoded antimicrobial peptides (AMPs). Comprehensive benchmarking of SProtFP revealed an average area under the receiver operator curve (ROC-AUC) of 0.92 during 10-fold cross-validation and an ROC-AUC of 0.94 and 0.93 on held-out balanced and imbalanced test sets. Utilizing our method to annotate bacterial isolates from the human gut microbiome, we could identify thousands of remote homologs of known small protein families and assign putative functions to uncharacterized proteins. This highlights the utility of SProtFP for large-scale functional annotation of microbiome datasets, especially in cases where sequence homology is low. SProtFP is freely available at http://www.nii.ac.in/sprotfp.html and can be combined with genome annotation tools such as ProsmORF-pred to uncover the functional repertoire of novel small proteins in bacteria.
Collapse
Affiliation(s)
- Akshay Khanduja
- National Institute of Immunology, Aruna Asaf Ali Marg, New Delhi 110067, India
| | - Debasisa Mohanty
- National Institute of Immunology, Aruna Asaf Ali Marg, New Delhi 110067, India
| |
Collapse
|
2
|
Meng C, Hou Y, Zou Q, Shi L, Su X, Ju Y. Rore: robust and efficient antioxidant protein classification via a novel dimensionality reduction strategy based on learning of fewer features. Genomics Inform 2024; 22:29. [PMID: 39633440 PMCID: PMC11616364 DOI: 10.1186/s44342-024-00026-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2024] [Accepted: 10/03/2024] [Indexed: 12/07/2024] Open
Abstract
In protein identification, researchers increasingly aim to achieve efficient classification using fewer features. While many feature selection methods effectively reduce the number of model features, they often cause information loss caused by merely selecting or discarding features, which limits classifier performance. To address this issue, we present Rore, an algorithm based on a feature-dimensionality reduction strategy. By mapping the original features to a latent space, Rore retains all relevant feature information while using fewer representations of the latent features. This approach significantly preserves the original information and overcomes the information loss problem associated with previous feature selection. Through extensive experimental validation and analysis, Rore demonstrated excellent performance on an antioxidant protein dataset, achieving an accuracy of 95.88% and MCC of 91.78%, using vectors including only 15 features. The Rore algorithm is available online at http://112.124.26.17:8021/Rore .
Collapse
Affiliation(s)
- Chaolu Meng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, Hohhot, China
| | - Yongqi Hou
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Huangpu District, No. 415, Fengyang Road, Shanghai, China
| | - Xi Su
- Foshan Women and Children Hospital, Foshan, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China.
| |
Collapse
|
3
|
Bizzotto E, Zampieri G, Treu L, Filannino P, Di Cagno R, Campanaro S. Classification of bioactive peptides: A systematic benchmark of models and encodings. Comput Struct Biotechnol J 2024; 23:2442-2452. [PMID: 38867723 PMCID: PMC11168199 DOI: 10.1016/j.csbj.2024.05.040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/10/2024] [Accepted: 05/22/2024] [Indexed: 06/14/2024] Open
Abstract
Bioactive peptides are short amino acid chains possessing biological activity and exerting physiological effects relevant to human health. Despite their therapeutic value, their identification remains a major problem, as it mainly relies on time-consuming in vitro tests. While bioinformatic tools for the identification of bioactive peptides are available, they are focused on specific functional classes and have not been systematically tested on realistic settings. To tackle this problem, bioactive peptide sequences and functions were here gathered from a variety of databases to generate a unified collection of bioactive peptides from microbial fermentation. This collection was organized into nine functional classes including some previously studied and some unexplored such as immunomodulatory, opioid and cardiovascular peptides. Upon assessing their sequence properties, four alternative encoding methods were tested in combination with a multitude of machine learning algorithms, from basic classifiers like logistic regression to advanced algorithms like BERT. Tests on a total of 171 models showed that, while some functions are intrinsically easier to detect, no single combination of classifiers and encoders worked universally well for all classes. For this reason, we unified all the best individual models for each class and generated CICERON (Classification of bIoaCtive pEptides fRom micrObial fermeNtation), a classification tool for the functional classification of peptides. State-of-the-art classifiers were found to underperform on our realistic benchmark dataset compared to the models included in CICERON. Altogether, our work provides a tool for real-world peptide classification and can serve as a benchmark for future model development.
Collapse
Affiliation(s)
- Edoardo Bizzotto
- Department of Biology, University of Padua, Via U. Bassi 58/b, Padova 35131, Italy
| | - Guido Zampieri
- Department of Biology, University of Padua, Via U. Bassi 58/b, Padova 35131, Italy
| | - Laura Treu
- Department of Biology, University of Padua, Via U. Bassi 58/b, Padova 35131, Italy
| | - Pasquale Filannino
- Department of Soil, Plant and Food Science, University of Bari Aldo Moro, Via G. Amendola 165/a, Bari 70126, Italy
| | - Raffaella Di Cagno
- Faculty of Agricultural, Environmental and Food Sciences, Free University of Bolzano, Piazza Universita, 5, Bolzano 39100, Italy
| | - Stefano Campanaro
- Department of Biology, University of Padua, Via U. Bassi 58/b, Padova 35131, Italy
| |
Collapse
|
4
|
Rukh G, Akbar S, Rehman G, Alarfaj FK, Zou Q. StackedEnC-AOP: prediction of antioxidant proteins using transform evolutionary and sequential features based multi-scale vector with stacked ensemble learning. BMC Bioinformatics 2024; 25:256. [PMID: 39098908 PMCID: PMC11298090 DOI: 10.1186/s12859-024-05884-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Accepted: 07/29/2024] [Indexed: 08/06/2024] Open
Abstract
BACKGROUND Antioxidant proteins are involved in several biological processes and can protect DNA and cells from the damage of free radicals. These proteins regulate the body's oxidative stress and perform a significant role in many antioxidant-based drugs. The current invitro-based medications are costly, time-consuming, and unable to efficiently screen and identify the targeted motif of antioxidant proteins. METHODS In this model, we proposed an accurate prediction method to discriminate antioxidant proteins namely StackedEnC-AOP. The training sequences are formulation encoded via incorporating a discrete wavelet transform (DWT) into the evolutionary matrix to decompose the PSSM-based images via two levels of DWT to form a Pseudo position-specific scoring matrix (PsePSSM-DWT) based embedded vector. Additionally, the Evolutionary difference formula and composite physiochemical properties methods are also employed to collect the structural and sequential descriptors. Then the combined vector of sequential features, evolutionary descriptors, and physiochemical properties is produced to cover the flaws of individual encoding schemes. To reduce the computational cost of the combined features vector, the optimal features are chosen using Minimum redundancy and maximum relevance (mRMR). The optimal feature vector is trained using a stacking-based ensemble meta-model. RESULTS Our developed StackedEnC-AOP method reported a prediction accuracy of 98.40% and an AUC of 0.99 via training sequences. To evaluate model validation, the StackedEnC-AOP training model using an independent set achieved an accuracy of 96.92% and an AUC of 0.98. CONCLUSION Our proposed StackedEnC-AOP strategy performed significantly better than current computational models with a ~ 5% and ~ 3% improved accuracy via training and independent sets, respectively. The efficacy and consistency of our proposed StackedEnC-AOP make it a valuable tool for data scientists and can execute a key role in research academia and drug design.
Collapse
Affiliation(s)
- Gul Rukh
- Department of Zoology, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Shahid Akbar
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, People's Republic of China
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Gauhar Rehman
- Department of Zoology, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Fawaz Khaled Alarfaj
- Department of Management Information Systems (MIS), School of Business, King Faisal University (KFU), 31982, Al-Ahsa, Saudi Arabia
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, People's Republic of China.
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, People's Republic of China.
| |
Collapse
|
5
|
Akbar S, Raza A, Zou Q. Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model. BMC Bioinformatics 2024; 25:102. [PMID: 38454333 PMCID: PMC10921744 DOI: 10.1186/s12859-024-05726-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 03/01/2024] [Indexed: 03/09/2024] Open
Abstract
BACKGROUND Viral infections have been the main health issue in the last decade. Antiviral peptides (AVPs) are a subclass of antimicrobial peptides (AMPs) with substantial potential to protect the human body against various viral diseases. However, there has been significant production of antiviral vaccines and medications. Recently, the development of AVPs as an antiviral agent suggests an effective way to treat virus-affected cells. Recently, the involvement of intelligent machine learning techniques for developing peptide-based therapeutic agents is becoming an increasing interest due to its significant outcomes. The existing wet-laboratory-based drugs are expensive, time-consuming, and cannot effectively perform in screening and predicting the targeted motif of antiviral peptides. METHODS In this paper, we proposed a novel computational model called Deepstacked-AVPs to discriminate AVPs accurately. The training sequences are numerically encoded using a novel Tri-segmentation-based position-specific scoring matrix (PSSM-TS) and word2vec-based semantic features. Composition/Transition/Distribution-Transition (CTDT) is also employed to represent the physiochemical properties based on structural features. Apart from these, the fused vector is formed using PSSM-TS features, semantic information, and CTDT descriptors to compensate for the limitations of single encoding methods. Information gain (IG) is applied to choose the optimal feature set. The selected features are trained using a stacked-ensemble classifier. RESULTS The proposed Deepstacked-AVPs model achieved a predictive accuracy of 96.60%%, an area under the curve (AUC) of 0.98, and a precision-recall (PR) value of 0.97 using training samples. In the case of the independent samples, our model obtained an accuracy of 95.15%, an AUC of 0.97, and a PR value of 0.97. CONCLUSION Our Deepstacked-AVPs model outperformed existing models with a ~ 4% and ~ 2% higher accuracy using training and independent samples, respectively. The reliability and efficacy of the proposed Deepstacked-AVPs model make it a valuable tool for scientists and may perform a beneficial role in pharmaceutical design and research academia.
Collapse
Affiliation(s)
- Shahid Akbar
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, People's Republic of China
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Ali Raza
- Department of Physical and Numerical Sciences, Qurtuba University of Science and Information Technology, Peshawar, 25124, KP, Pakistan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, People's Republic of China.
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, People's Republic of China.
| |
Collapse
|
6
|
Meng C, Pei Y, Bu Y, Zou Q, Ju Y. Machine learning-based antioxidant protein identification model: Progress and evaluation. J Cell Biochem 2023; 124:1825-1834. [PMID: 37877550 DOI: 10.1002/jcb.30491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 09/30/2023] [Accepted: 10/06/2023] [Indexed: 10/26/2023]
Abstract
Efficient and accurate identification of antioxidant proteins is of great significance. In recent years, many models for identifying antioxidant proteins have been proposed, but the low sensitivity and high dimensionality of the models are common problems. The generalization ability of the model needs to be improved. Researchers have tried different feature extraction algorithms and feature selection algorithms to obtain the most effective feature combination and have chosen more appropriate classification algorithms and tools to improve model performance. In this article, we systematically reviewed the data set of the most frequently used antioxidant proteins and the method selection for each step of model establishment and discussed the characteristics of each method. We have conducted a detailed analysis of recent research and believe that the practical ability and efficiency of model application can be improved by reducing model dimensions. The key to improving the performance of antioxidant protein recognition models in the future may lie in feature selection, so this paper also focuses on the combination of feature extraction and selection steps in the analysis of the model building process.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, Hohhot, China
| | - Yue Pei
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Yongbo Bu
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| |
Collapse
|
7
|
Meng C, Pei Y, Zou Q, Yuan L. DP-AOP: A novel SVM-based antioxidant proteins identifier. Int J Biol Macromol 2023; 247:125499. [PMID: 37414318 DOI: 10.1016/j.ijbiomac.2023.125499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 06/01/2023] [Accepted: 06/19/2023] [Indexed: 07/08/2023]
Abstract
The identification of antioxidant proteins is a challenging yet meaningful task, as they can protect against the damage caused by some free radicals. In addition to time-consuming, laborious, and expensive experimental identification methods, efficient identification of antioxidant proteins through machine learning algorithms has become increasingly common. In recent years, researchers have proposed models for identifying antioxidant proteins; unfortunately, although the accuracy of models is already high, their sensitivity is too low, indicating the possibility of overfitting in the model. Therefore, we developed a new model called DP-AOP for the recognition of antioxidant proteins. We used the SMOTE algorithm to balance the dataset, selected Wei's proposed feature extraction algorithm to obtain 473 dimensional feature vectors, and based on the sorting function in MRMD, scored and ranked each feature to obtain a feature set with contribution values ranging from high to low. To effectively reduce the feature dimension, we combined the dynamic programming idea to make the local eight features the optimal subset. After obtaining the 36 dimensional feature vectors, we finally selected 17 features through experimental analysis. The SVM classification algorithm was used to implement the model through the libsvm tool. The model achieved satisfactory performance, with an accuracy rate of 91.076 %, SN of 96.4 %, SP of 85.8 %, MCC of 82.6 %, and F1 core of 91.5 %. Furthermore, we built a free web server to facilitate researchers' subsequent unfolding studies of antioxidant protein recognition. The website is http://112.124.26.17:8003/#/.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China; Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, China.
| | - Yue Pei
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, China.
| | - Lei Yuan
- Department of Hepatobiliary Surgery, Quzhou People's Hospital, China.
| |
Collapse
|
8
|
Gu X, Ding Y, Xiao P, He T. A GHKNN model based on the physicochemical property extraction method to identify SNARE proteins. Front Genet 2022; 13:935717. [PMID: 36506312 PMCID: PMC9727185 DOI: 10.3389/fgene.2022.935717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 11/02/2022] [Indexed: 11/24/2022] Open
Abstract
There is a great deal of importance to SNARE proteins, and their absence from function can lead to a variety of diseases. The SNARE protein is known as a membrane fusion protein, and it is crucial for mediating vesicle fusion. The identification of SNARE proteins must therefore be conducted with an accurate method. Through extensive experiments, we have developed a model based on graph-regularized k-local hyperplane distance nearest neighbor model (GHKNN) binary classification. In this, the model uses the physicochemical property extraction method to extract protein sequence features and the SMOTE method to upsample protein sequence features. The combination achieves the most accurate performance for identifying all protein sequences. Finally, we compare the model based on GHKNN binary classification with other classifiers and measure them using four different metrics: SN, SP, ACC, and MCC. In experiments, the model performs significantly better than other classifiers.
Collapse
Affiliation(s)
- Xingyue Gu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Pengfeng Xiao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| | - Tao He
- Beidahuang Industry Group General Hospital, Harbin, China
| |
Collapse
|
9
|
Suleman MT, Alkhalifah T, Alturise F, Khan YD. DHU-Pred: accurate prediction of dihydrouridine sites using position and composition variant features on diverse classifiers. PeerJ 2022; 10:e14104. [PMID: 36320563 PMCID: PMC9618264 DOI: 10.7717/peerj.14104] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Accepted: 09/01/2022] [Indexed: 01/21/2023] Open
Abstract
Background Dihydrouridine (D) is a modified transfer RNA post-transcriptional modification (PTM) that occurs abundantly in bacteria, eukaryotes, and archaea. The D modification assists in the stability and conformational flexibility of tRNA. The D modification is also responsible for pulmonary carcinogenesis in humans. Objective For the detection of D sites, mass spectrometry and site-directed mutagenesis have been developed. However, both are labor-intensive and time-consuming methods. The availability of sequence data has provided the opportunity to build computational models for enhancing the identification of D sites. Based on the sequence data, the DHU-Pred model was proposed in this study to find possible D sites. Methodology The model was built by employing comprehensive machine learning and feature extraction approaches. It was then validated using in-demand evaluation metrics and rigorous experimentation and testing approaches. Results The DHU-Pred revealed an accuracy score of 96.9%, which was considerably higher compared to the existing D site predictors. Availability and Implementation A user-friendly web server for the proposed model was also developed and is freely available for the researchers.
Collapse
Affiliation(s)
- Muhammad Taseer Suleman
- Department of Computer Science, School of Systems and Technology, University of Management & Technology, Lahore, Pakistan
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass Qassim University, Ar Rass, Qassim, Saudi Arabia
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass Qassim University, Ar Rass, Qassim, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management & Technology, Lahore, Pakistan
| |
Collapse
|
10
|
Zhao J, Zhuang M, Liu J, Zhang M, Zeng C, Jiang B, Wu J, Song X. pHisPred: a tool for the identification of histidine phosphorylation sites by integrating amino acid patterns and properties. BMC Bioinformatics 2022; 23:399. [PMID: 36171552 PMCID: PMC9520798 DOI: 10.1186/s12859-022-04938-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Accepted: 09/16/2022] [Indexed: 11/17/2022] Open
Abstract
Background Protein histidine phosphorylation (pHis) plays critical roles in prokaryotic signal transduction pathways and various eukaryotic cellular processes. It is estimated to account for 6–10% of the phosphoproteome, however only hundreds of pHis sites have been discovered to date. Due to the inherent disadvantages of experimental methods, it is an urgent task for developing efficient computational approaches to identify pHis sites. Results Here, we present a novel tool, pHisPred, for accurately identifying pHis sites from protein sequences. We manually collected the largest number of experimental validated pHis sites to build benchmark datasets. Using randomized tenfold CV, the weighted SVM-RBF model shows the best performance than other four commonly used classification models (LR, KNN, RF, and MLP). From ten thousands of features, 140 and 150 most informative features were individually selected out for eukaryotic and prokaryotic models. The average AUC and F1-score values of pHisPred were (0.81, 0.40) and (0.78, 0.46) for tenfold CV on the eukaryotic and prokaryotic training datasets, respectively. In addition, pHisPred significantly outperforms other tools on testing datasets, in particular on the eukaryotic one. Conclusion We implemented a python program of pHisPred, which is freely available for non-commercial use at https://github.com/xiaofengsong/pHisPred. Moreover, users can use it to train new models with their own data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04938-x.
Collapse
Affiliation(s)
- Jian Zhao
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Minhui Zhuang
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Jingjing Liu
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Meng Zhang
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Cong Zeng
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Bin Jiang
- College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
| | - Jing Wu
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing, 211166, China.
| | - Xiaofeng Song
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China.
| |
Collapse
|
11
|
The Interplay of Oxidative Stress and ROS Scavenging: Antioxidants as a Therapeutic Potential in Sepsis. Vaccines (Basel) 2022; 10:vaccines10101575. [PMID: 36298439 PMCID: PMC9609850 DOI: 10.3390/vaccines10101575] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2022] [Revised: 09/09/2022] [Accepted: 09/13/2022] [Indexed: 12/05/2022] Open
Abstract
Oxidative stress resulting from the disproportion of oxidants and antioxidants contributes to both physiological and pathological conditions in sepsis. To combat this, the antioxidant defense system comes into the picture, which contributes to limiting the amount of reactive oxygen species (ROS) leading to the reduction of oxidative stress. However, a strong relationship has been found between scavengers of ROS and antioxidants in preclinical in vitro and in vivo models. ROS is widely believed to cause human pathology most specifically in sepsis, where a small increase in ROS levels activates signaling pathways to initiate biological processes. An inclusive understanding of the effects of ROS scavenging in cellular antioxidant signaling is essentially lacking in sepsis. This review compiles the mechanisms of ROS scavenging as well as oxidative damage in sepsis, as well as antioxidants as a potent therapeutic. Direct interaction between ROS and cellular pathways greatly affects sepsis, but such interaction does not provide the explanation behind diverse biological outcomes. Animal models of sepsis and a number of clinical trials with septic patients exploring the efficiency of antioxidants in sepsis are reviewed. In line with this, both enzymatic and non-enzymatic antioxidants were effective, and results from recent studies are promising. The usage of these potent antioxidants in sepsis patients would greatly impact the field of medicine.
Collapse
|
12
|
Li F, Yin J, Lu M, Yang Q, Zeng Z, Zhang B, Li Z, Qiu Y, Dai H, Chen Y, Zhu F. ConSIG: consistent discovery of molecular signature from OMIC data. Brief Bioinform 2022; 23:6618243. [PMID: 35758241 DOI: 10.1093/bib/bbac253] [Citation(s) in RCA: 55] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Revised: 05/09/2022] [Accepted: 05/31/2022] [Indexed: 12/12/2022] Open
Abstract
The discovery of proper molecular signature from OMIC data is indispensable for determining biological state, physiological condition, disease etiology, and therapeutic response. However, the identified signature is reported to be highly inconsistent, and there is little overlap among the signatures identified from different biological datasets. Such inconsistency raises doubts about the reliability of reported signatures and significantly hampers its biological and clinical applications. Herein, an online tool, ConSIG, was constructed to realize consistent discovery of gene/protein signature from any uploaded transcriptomic/proteomic data. This tool is unique in a) integrating a novel strategy capable of significantly enhancing the consistency of signature discovery, b) determining the optimal signature by collective assessment, and c) confirming the biological relevance by enriching the disease/gene ontology. With the increasingly accumulated concerns about signature consistency and biological relevance, this online tool is expected to be used as an essential complement to other existing tools for OMIC-based signature discovery. ConSIG is freely accessible to all users without login requirement at https://idrblab.org/consig/.
Collapse
Affiliation(s)
- Fengcheng Li
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Jiayi Yin
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Mingkun Lu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Qingxia Yang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Zhenyu Zeng
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Bing Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Zhaorong Li
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Yunqing Qiu
- State Key Laboratory for Diagnosis and Treatment of Infectious Disease, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, The First Affiliated Hospital, Zhejiang University, 79 QingChun Road, Hangzhou, Zhejiang 310000, China
| | - Haibin Dai
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Yuzong Chen
- State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, The Graduate School at Shenzhen, Tsinghua University, Shenzhen 518055, China.,Qian Xuesen Collaborative Research Center of Astrochemistry and Space Life Sciences, Institute of Drug Discovery Technology, Ningbo University, Ningbo 315211, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China.,Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| |
Collapse
|
13
|
Xu C, Zhang R, Duan M, Zhou Y, Bao J, Lu H, Wang J, Hu M, Hu Z, Zhou F, Zhu W. A polygenic stacking classifier revealed the complicated platelet transcriptomic landscape of adult immune thrombocytopenia. MOLECULAR THERAPY - NUCLEIC ACIDS 2022; 28:477-487. [PMID: 35505964 PMCID: PMC9046129 DOI: 10.1016/j.omtn.2022.04.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Accepted: 04/01/2022] [Indexed: 01/19/2023]
Abstract
Immune thrombocytopenia (ITP) is an autoimmune disease with the typical symptom of a low platelet count in blood. ITP demonstrated age and sex biases in both occurrences and prognosis, and adult ITP was mainly induced by the living environments. The current diagnosis guideline lacks the integration of molecular heterogenicity. This study recruited the largest cohort of platelet transcriptome samples. A comprehensive procedure of feature selection, feature engineering, and stacking classification was carried out to detect the ITP biomarkers using RNA sequencing (RNA-seq) transcriptomes. The 40 detected biomarkers were loaded to train the final ITP detection model, with an overall accuracy 0.974. The biomarkers suggested that ITP onset may be associated with various transcribed components, including protein-coding genes, long intergenic non-coding RNA (lincRNA) genes, and pseudogenes with apparent transcriptions. The delivered ITP detection model may also be utilized as a complementary ITP diagnosis tool. The code and the example dataset is freely available on http://www.healthinformaticslab.org/supp/resources.php
Collapse
Affiliation(s)
- Chengfeng Xu
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
| | - Ruochi Zhang
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
| | - Meiyu Duan
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
| | - Yongming Zhou
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
| | - Jizhang Bao
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
| | - Hao Lu
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
| | - Jie Wang
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
| | - Minghui Hu
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
| | - Zhaoyang Hu
- Fun-Med Pharmaceutical Technology (Shanghai) Co., Ltd., RM. A310, 115 Xinjunhuan Road, Minhang District, Shanghai 201100, China
- Corresponding author Zhaoyang Hu, PhD, Fengneng Pharmaceutical Technology (Shanghai) Co., Ltd., RM. A310, 115 Xinjunhuan Road, Minhang District, Shanghai 201100, China.
| | - Fengfeng Zhou
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
- Corresponding author Fengfeng Zhou, PhD, College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China.
| | - Wenwei Zhu
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
- Corresponding author Wenwei Zhu, PhD, Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China.
| |
Collapse
|
14
|
Tran HV, Nguyen QH. iAnt: Combination of Convolutional Neural Network and Random Forest Models Using PSSM and BERT Features to Identify Antioxidant Proteins. Curr Bioinform 2022. [DOI: 10.2174/1574893616666210820095144] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Reactive Oxygen Species (ROS) play many roles in the body, such as cell signaling,
homeostasis, or protection from harmful bacteria. However, an excess of ROS in the body will
damage lipids, proteins, and DNA. Many studies have shown that various environmental factors increase
the amount of ROS produced in the body. Antioxidant proteins are responsible for neutralizing
these ROS or free radicals. Although the amount of data on protein sequences has increased over the
last two decades, we still lack bioinformatics tools to be able to accurately identify antioxidant protein
sequences. Furthermore, biochemical methods to determine antioxidant proteins are very expensive and
time-consuming. Therefore, a machine learning approach must be used to speed up the computation.
Methods:
In this study, we propose a new method that combines a convolutional neural network and Random
Forest using two features, the normalized PSSM and the best-selected feature of the ProtBert output.
Results:
Our model gave very good results on the independent test dataset with 97.3% sensitivity and
95.9% specificity. Comparison with current state-of-the-art models shows that our model is superior.
We have also installed iAnt as an online website with a friendly interface available at a website: http:
//antixiodant.nguyenhongquang.edu.vn.
Conclusion:
iAnt has been developed to accurately identify the antioxidant protein. It shows results
outperforming the existing state-of-the-art methods; it is also available online.
Collapse
Affiliation(s)
- Hoang V. Tran
- Department of Computer Engineering, School of Information and Communication Technology, Hanoi University of
Science and Technology, Hanoi, Vietnam
| | - Quang H. Nguyen
- Department of Computer Engineering, School of Information and Communication Technology, Hanoi University of
Science and Technology, Hanoi, Vietnam
| |
Collapse
|
15
|
Zhai Y, Zhang J, Zhang T, Gong Y, Zhang Z, Zhang D, Zhao Y. AOPM: Application of Antioxidant Protein Classification Model in Predicting the Composition of Antioxidant Drugs. Front Pharmacol 2022; 12:818115. [PMID: 35115948 PMCID: PMC8803896 DOI: 10.3389/fphar.2021.818115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 12/20/2021] [Indexed: 11/18/2022] Open
Abstract
Antioxidant proteins can not only balance the oxidative stress in the body, but are also an important component of antioxidant drugs. Accurate identification of antioxidant proteins is essential to help humans fight diseases and develop new drugs. In this paper, we developed a friendly method AOPM to identify antioxidant proteins. 188D and the Composition of k-spaced Amino Acid Pairs were adopted as the feature extraction method. In addition, the Max-Relevance-Max-Distance algorithm (MRMD) and random forest were the feature selection and classifier, respectively. We used 5-folds cross-validation and independent test dataset to evaluate our model. On the test dataset, AOPM presented a higher performance compared with the state-of-the-art methods. The sensitivity, specificity, accuracy, Matthew’s Correlation Coefficient and an Area Under the Curve reached 87.3, 94.2, 92.0%, 0.815 and 0.972, respectively. In addition, AOPM still has excellent performance in predicting the catalytic enzymes of antioxidant drugs. This work proved the feasibility of virtual drug screening based on sequence information and provided new ideas and solutions for drug development.
Collapse
Affiliation(s)
- Yixiao Zhai
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Jingyu Zhang
- Department of Neurology, the Fourth Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yue Gong
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zixiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Dandan Zhang
- Department of Obstetrics and Gynecology, the First Affiliated Hospital of Harbin Medical University, Harbin, China
- *Correspondence: Dandan Zhang, ; Yuming Zhao,
| | - Yuming Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- *Correspondence: Dandan Zhang, ; Yuming Zhao,
| |
Collapse
|
16
|
High Throughput Identification of the Potential Antioxidant Peptides in Ophiocordyceps sinensis. MOLECULES (BASEL, SWITZERLAND) 2022; 27:molecules27020438. [PMID: 35056752 PMCID: PMC8780859 DOI: 10.3390/molecules27020438] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 01/05/2022] [Accepted: 01/06/2022] [Indexed: 11/17/2022]
Abstract
Ophiocordyceps sinensis, an ascomycete caterpillar fungus, has been used as a Traditional Chinese Medicine owing to its bioactive properties. However, until now the bio-active peptides have not been identified in this fungus. Here, the raw RNA sequences of three crucial growth stages of the artificially cultivated O. sinensis and the wild-grown mature fruit-body were aligned to the genome of O. sinensis. Both homology-based prediction and de novo-based prediction methods were used to identify 8541 putative antioxidant peptides (pAOPs). The expression profiles of the cultivated mature fruiting body were similar to those found in the wild specimens. The differential expression of 1008 pAOPs matched genes had the highest difference between ST and MF, suggesting that the pAOPs were primarily induced and play important roles in the process of the fruit-body maturation. Gene ontology analysis showed that most of pAOPs matched genes were enriched in terms of ‘cell redox homeostasis’, ‘response to oxidative stresses’, ‘catalase activity’, and ‘ integral component of cell membrane’. A total of 1655 pAOPs was identified in our protein-seqs, and some crucial pAOPs were selected, including catalase, peroxiredoxin, and SOD [Cu–Zn]. Our findings offer the first identification of the active peptide ingredients in O. sinensis, facilitating the discovery of anti-infectious bio-activity and the understanding of the roles of AOPs in fungal pathogenicity and the high-altitude adaptation in this medicinal fungus.
Collapse
|
17
|
Bhosale H, Ramakrishnan V, Jayaraman VK. Support vector machine-based prediction of pore-forming toxins (PFT) using distributed representation of reduced alphabets. J Bioinform Comput Biol 2021; 19:2150028. [PMID: 34693886 DOI: 10.1142/s0219720021500281] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Bacterial virulence can be attributed to a wide variety of factors including toxins that harm the host. Pore-forming toxins are one class of toxins that confer virulence to the bacteria and are one of the promising targets for therapeutic intervention. In this work, we develop a sequence-based machine learning framework for the prediction of pore-forming toxins. For this, we have used distributed representation of the protein sequence encoded by reduced alphabet schemes based on conformational similarity and hydropathy index as input features to Support Vector Machines (SVMs). The choice of conformational similarity and hydropathy indices is based on the functional mechanism of pore-forming toxins. Our methodology achieves about 81% accuracy indicating that conformational similarity, an indicator of the flexibility of amino acids, along with hydrophobic index can capture the intrinsic features of pore-forming toxins that distinguish it from other types of transporter proteins. Increased understanding of the mechanisms of pore-forming toxins can further contribute to the use of such "mechanism-informed" features that may increase the prediction accuracy further.
Collapse
Affiliation(s)
- Hrushikesh Bhosale
- Department of Computer Science, FLAME University, Pune, Maharashtra, India
| | - Vigneshwar Ramakrishnan
- School of Chemical & Biotechnology, SASTRA Deemed-to-be University, Thanjavur, Tamilnadu, India
| | - Valadi K Jayaraman
- Department of Computer Science, FLAME University, Pune, Maharashtra, India
| |
Collapse
|
18
|
Sohrawordi M, Hossain MA. Prediction of lysine formylation sites using support vector machine based on the sample selection from majority classes and synthetic minority over-sampling techniques. Biochimie 2021; 192:125-135. [PMID: 34627982 DOI: 10.1016/j.biochi.2021.10.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Revised: 10/03/2021] [Accepted: 10/05/2021] [Indexed: 12/22/2022]
Abstract
Lysine formylation is a newly discovered and mostly interested type of post-translational modification (PTM) that is generally found on core and linker histone proteins of prokaryote and eukaryote and plays various important roles on the regulation of various cellular mechanisms. Hence, it is very urgent to properly identify formylation site in protein for understanding the molecular mechanism of formylation deeply and defining drug for relevant diseases. As experimentally identification of formylation site using traditional processes are expensive and time consuming, a simple and high speedy mathematical model for predicting accurately lysine formylation sites is highly desired. A useful computational model named PLF_SVM is deigned and proposed in this study by using binary encoding (BE), amino acid composition (AAC), reverse position relative incidence matrix (RPRIM), position relative incidence matrix (PRIM), and position specific amino acid propensity (PSAAP) feature generation methods for predicting formylated and non-formylated lysine sites. Besides, the Synthetic Minority Oversampling Technique (SMOTE) and a proposed sample selection strategy named EnSVM are applied to handle the imbalance training dataset problem. Thereafter, the optimal number of features are selected by F-score method to train the model. Finally, it has been seen that PLF_SVM outperforms the state-of-the-art approaches in validation and independent test with an accuracy of 98.61% and 98.77% respectively. At https://plf-svm.herokuapp.com/, a user-friendly web tool is also created for identifying formylation sites. Therefore, the proposed method may be helpful guideline for the analysis and prediction of formylated lysine and knowing the process of cellular regulation.
Collapse
Affiliation(s)
- Md Sohrawordi
- Dept. of Computer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh; Dept. of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, Bangladesh.
| | - Md Ali Hossain
- Dept. of Computer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh
| |
Collapse
|
19
|
Liu T, Chen J, Zhang Q, Hippe K, Hunt C, Le T, Cao R, Tang H. The Development of Machine Learning Methods in discriminating Secretory Proteins of Malaria Parasite. Curr Med Chem 2021; 29:807-821. [PMID: 34636289 DOI: 10.2174/0929867328666211005140625] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 07/28/2021] [Accepted: 08/15/2021] [Indexed: 11/22/2022]
Abstract
Malaria caused by Plasmodium falciparum is one of the major infectious diseases in the world. It is essential to exploit an effective method to predict secretory proteins of malaria parasites to develop effective cures and treatment. Biochemical assays can provide details for accurate identification of the secretory proteins, but these methods are expensive and time-consuming. In this paper, we summarized the machine learning-based identification algorithms and compared the construction strategies between different computational methods. Also, we discussed the use of machine learning to improve the ability of algorithms to identify proteins secreted by malaria parasites.
Collapse
Affiliation(s)
- Ting Liu
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Jiamao Chen
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Qian Zhang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Kyle Hippe
- Department of Computer Science, Pacific Lutheran University. United States
| | - Cassandra Hunt
- Department of Computer Science, Pacific Lutheran University. United States
| | - Thu Le
- Department of Computer Science, Pacific Lutheran University. United States
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University. United States
| | - Hua Tang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| |
Collapse
|
20
|
Shen Z, Liu T, Xu T. Accurate Identification of Antioxidant Proteins Based on a Combination of Machine Learning Techniques and Hidden Markov Model Profiles. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:5770981. [PMID: 34413898 PMCID: PMC8369162 DOI: 10.1155/2021/5770981] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Revised: 07/15/2021] [Accepted: 07/26/2021] [Indexed: 01/19/2023]
Abstract
Antioxidant proteins (AOPs) play important roles in the management and prevention of several human diseases due to their ability to neutralize excess free radicals. However, the identification of AOPs by using wet-lab experimental techniques is often time-consuming and expensive. In this study, we proposed an accurate computational model, called AOP-HMM, to predict AOPs by extracting discriminatory evolutionary features from hidden Markov model (HMM) profiles. First, auto cross-covariance (ACC) variables were applied to transform the HMM profiles into fixed-length feature vectors. Then, we performed the analysis of variance (ANOVA) method to reduce the dimensionality of the raw feature space. Finally, a support vector machine (SVM) classifier was adopted to conduct the prediction of AOPs. To comprehensively evaluate the performance of the proposed AOP-HMM model, the 10-fold cross-validation (CV), the jackknife CV, and the independent test were carried out on two widely used benchmark datasets. The experimental results demonstrated that AOP-HMM outperformed most of the existing methods and could be used to quickly annotate AOPs and guide the experimental process.
Collapse
Affiliation(s)
- Zhehan Shen
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Ting Xu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| |
Collapse
|
21
|
Carpio LE, Sanz Y, Gozalbes R, Barigye SJ. Computational strategies for the discovery of biological functions of health foods, nutraceuticals and cosmeceuticals: a review. Mol Divers 2021; 25:1425-1438. [PMID: 34258685 PMCID: PMC8277569 DOI: 10.1007/s11030-021-10277-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Accepted: 07/07/2021] [Indexed: 11/29/2022]
Abstract
Scientific and consumer interest in healthy foods (also known as functional foods), nutraceuticals and cosmeceuticals has increased in the recent years, leading to an increased presence of these products in the market. However, the regulations across different countries that define the type of claims that may be made, and the degree of evidence required to support these claims, are rather inconsistent. Moreover, there is also controversy on the effectiveness and biological mode of action of many of these products, which should undergo an exhaustive approval process to guarantee the consumer rights. Computational approaches constitute invaluable tools to facilitate the discovery of bioactive molecules and provide biological plausibility on the mode of action of these products. Indeed, methodologies like QSAR, docking or molecular dynamics have been used in drug discovery protocols for decades and can now aid in the discovery of bioactive food components. Thanks to these approaches, it is possible to search for new functions in food constituents, which may be part of our daily diet, and help to prevent disorders like diabetes, hypercholesterolemia or obesity. In the present manuscript, computational studies applied to this field are reviewed to illustrate the potential of these approaches to guide the first screening steps and the mechanistic studies of nutraceutical, cosmeceutical and functional foods.
Collapse
Affiliation(s)
- Laureano E Carpio
- ProtoQSAR SL, CEEI (Centro Europeo de Empresas Innovadoras), Parque Tecnológico de Valencia, Valencia, Spain
| | - Yolanda Sanz
- Microbial Ecology, Nutrition and Health Research Unit, Institute of Agrochemistry and Food Technology, National Research Council (IATA-CSIC), Valencia, Spain
| | - Rafael Gozalbes
- ProtoQSAR SL, CEEI (Centro Europeo de Empresas Innovadoras), Parque Tecnológico de Valencia, Valencia, Spain
| | - Stephen J Barigye
- ProtoQSAR SL, CEEI (Centro Europeo de Empresas Innovadoras), Parque Tecnológico de Valencia, Valencia, Spain.
- MolDrug AI Systems SL, Valencia, Spain.
| |
Collapse
|
22
|
ANOX: A robust computational model for predicting the antioxidant proteins based on multiple features. Anal Biochem 2021; 631:114257. [PMID: 34043981 DOI: 10.1016/j.ab.2021.114257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 05/12/2021] [Accepted: 05/14/2021] [Indexed: 11/20/2022]
Abstract
As an indispensable component of various living organisms, the antioxidant proteins have been studied for anti-aging and prevention of various diseases, such as altitude sickness, coronary heart disease, and even cancer. However, the traditional experimental methods for identifying the antioxidant proteins are very expensive and time-consuming. Thus, to address the challenge, a new predictor, named ANOX, was developed in this study. Multiple features, such as frequency matrix features (FRE), amino acid and dipeptide composition (AADP), evolutionary difference formula features (EEDP), k-separated bigrams (KSB), and PSI-PRED secondary structure (PRED), were extracted to generate the original feature space. To find the optimized feature subset, the Max-Relevance-Max-Distance (MRMD) algorithm was implemented for feature ranking and our model received the best performance with the top 1170 features. Rigorous tests were performed to evaluate the performance of ANOX, and the results showed that ANOX achieved a major improvement in the prediction accuracy of the antioxidant proteins (AUC:0.930 and 0.935 using 5-fold cross-validation or the jackknife test) compared to the state-of-the-art predictor AOPs-SVM (AUC:0.869 and 0.885). The dataset used in this study and the source code of ANOX are all available at https://github.com/NWAFU-LiuLab/ANOX.
Collapse
|
23
|
ANPrAod: Identify Antioxidant Proteins by Fusing Amino Acid Clustering Strategy and N-Peptide Combination. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:5518209. [PMID: 33927782 PMCID: PMC8049822 DOI: 10.1155/2021/5518209] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 03/02/2021] [Accepted: 03/10/2021] [Indexed: 11/18/2022]
Abstract
Antioxidant proteins perform significant functions in disease control and delaying aging which can prevent free radicals from damaging organisms. Accurate identification of antioxidant proteins has important implications for the development of new drugs and the treatment of related diseases, as they play a critical role in the control or prevention of cancer and aging-related conditions. Since experimental identification techniques are time-consuming and expensive, many computational methods have been proposed to identify antioxidant proteins. Although the accuracy of these methods is acceptable, there are still some challenges. In this study, we developed a computational model called ANPrAod to identify antioxidant proteins based on a support vector machine. In order to eliminate potential redundant features and improve prediction accuracy, 673 amino acid reduction alphabets were calculated by us to find the optimal feature representation scheme. The final model could produce an overall accuracy of 87.53% with the ROC of 0.7266 in five-fold cross-validation, which was better than the existing methods. The results of the independent dataset also demonstrated the excellent robustness and reliability of ANPrAod, which could be a promising tool for antioxidant protein identification and contribute to hypothesis-driven experimental design.
Collapse
|
24
|
Niu M, Lin Y, Zou Q. sgRNACNN: identifying sgRNA on-target activity in four crops using ensembles of convolutional neural networks. PLANT MOLECULAR BIOLOGY 2021; 105:483-495. [PMID: 33385273 DOI: 10.1007/s11103-020-01102-y] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2020] [Accepted: 12/01/2020] [Indexed: 06/12/2023]
Abstract
KEY MESSAGE We proposed an ensemble convolutional neural network model to identify sgRNA high on-target activity in four crops and we used one-hot encoding and k-mers for sequence encoding. As an important component of the CRISPR/Cas9 system, single-guide RNA (sgRNA) plays an important role in gene redirection and editing. sgRNA has played an important role in the improvement of agronomic species, but there is a lack of effective bioinformatics tools to identify the activity of sgRNA in agronomic species. Therefore, it is necessary to develop a method based on machine learning to identify sgRNA high on-target activity. In this work, we proposed a simple convolutional neural network method to identify sgRNA high on-target activity. Our study used one-hot encoding and k-mers for sequence data conversion and a voting algorithm for constructing the convolutional neural network ensemble model sgRNACNN for the prediction of sgRNA activity. The ensemble model sgRNACNN was used for predictions in four crops: Glycine max, Zea mays, Sorghum bicolor and Triticum aestivum. The accuracy rates of the four crops in the sgRNACNN model were 82.43%, 80.33%, 78.25% and 87.49%, respectively. The experimental results showed that sgRNACNN realizes the identification of high on-target activity sgRNA of agronomic data and can meet the demands of sgRNA activity prediction in agronomy to a certain extent. These results have certain significance for guiding crop gene editing and academic research. The source code and relevant dataset can be found in the following link: https://github.com/nmt315320/sgRNACNN.git .
Collapse
Affiliation(s)
- Mengting Niu
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Yuan Lin
- Department of System Integration, Sparebanken Vest, Bergen, Norway.
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
| |
Collapse
|
25
|
Ao C, Zhou W, Gao L, Dong B, Yu L. Prediction of antioxidant proteins using hybrid feature representation method and random forest. Genomics 2020; 112:4666-4674. [DOI: 10.1016/j.ygeno.2020.08.016] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 08/10/2020] [Accepted: 08/13/2020] [Indexed: 12/19/2022]
|
26
|
Meng C, Wu J, Guo F, Dong B, Xu L. CWLy-pred: A novel cell wall lytic enzyme identifier based on an improved MRMD feature selection method. Genomics 2020; 112:4715-4721. [DOI: 10.1016/j.ygeno.2020.08.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Revised: 08/04/2020] [Accepted: 08/13/2020] [Indexed: 10/25/2022]
|
27
|
Zhai Y, Chen Y, Teng Z, Zhao Y. Identifying Antioxidant Proteins by Using Amino Acid Composition and Protein-Protein Interactions. Front Cell Dev Biol 2020; 8:591487. [PMID: 33195258 PMCID: PMC7658297 DOI: 10.3389/fcell.2020.591487] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Accepted: 09/18/2020] [Indexed: 12/13/2022] Open
Abstract
Excessive oxidative stress responses can threaten our health, and thus it is essential to produce antioxidant proteins to regulate the body’s oxidative responses. The low number of antioxidant proteins makes it difficult to extract their representative features. Our experimental method did not use structural information but instead studied antioxidant proteins from a sequenced perspective while focusing on the impact of data imbalance on sensitivity, thus greatly improving the model’s sensitivity for antioxidant protein recognition. We developed a method based on the Composition of k-spaced Amino Acid Pairs (CKSAAP) and the Conjoint Triad (CT) features derived from the amino acid composition and protein-protein interactions. SMOTE and the Max-Relevance-Max-Distance algorithm (MRMD) were utilized to unbalance the training data and select the optimal feature subset, respectively. The test set used 10-fold crossing validation and a random forest algorithm for classification according to the selected feature subset. The sensitivity was 0.792, the specificity was 0.808, and the average accuracy was 0.8.
Collapse
Affiliation(s)
- Yixiao Zhai
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| | - Yu Chen
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| | - Zhixia Teng
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| | - Yuming Zhao
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| |
Collapse
|
28
|
Zhang J, Lv L, Lu D, Kong D, Al-Alashaari MAA, Zhao X. Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors. BMC Bioinformatics 2020; 21:480. [PMID: 33109082 PMCID: PMC7590791 DOI: 10.1186/s12859-020-03826-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 10/19/2020] [Indexed: 12/13/2022] Open
Abstract
Background Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered. Results Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method. Conclusions Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.
Collapse
Affiliation(s)
- Jian Zhang
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Lixin Lv
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Donglei Lu
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Denan Kong
- College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China
| | | | - Xudong Zhao
- College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China.
| |
Collapse
|
29
|
Li Q, Xu L, Li Q, Zhang L. Identification and Classification of Enhancers Using Dimension Reduction Technique and Recurrent Neural Network. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:8852258. [PMID: 33133227 PMCID: PMC7591959 DOI: 10.1155/2020/8852258] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 09/16/2020] [Accepted: 09/30/2020] [Indexed: 12/21/2022]
Abstract
Enhancers are noncoding fragments in DNA sequences, which play an important role in gene transcription and translation. However, due to their high free scattering and positional variability, the identification and classification of enhancers have a higher level of complexity than those of coding genes. In order to solve this problem, many computer studies have been carried out in this field, but there are still some deficiencies in these prediction models. In this paper, we use various feature extraction strategies, dimension reduction technology, and a comprehensive application of machine model and recurrent neural network model to achieve an accurate prediction of enhancer identification and classification with the accuracy of was 76.7% and 84.9%, respectively. The model proposed in this paper is superior to the previous methods in performance index or feature dimension, which provides inspiration for the prediction of enhancers by computer technology in the future.
Collapse
Affiliation(s)
- Qingwen Li
- College of Animal Science and Technology, Northeast Agricultural University, Harbin, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Qingyuan Li
- Forestry and Fruit Tree Research Institute, Wuhan Academy of Agricultural Sciences, Wuhan, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| |
Collapse
|
30
|
Machine Learning Model for Identifying Antioxidant Proteins Using Features Calculated from Primary Sequences. BIOLOGY 2020; 9:biology9100325. [PMID: 33036150 PMCID: PMC7599600 DOI: 10.3390/biology9100325] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Revised: 10/03/2020] [Accepted: 10/04/2020] [Indexed: 12/15/2022]
Abstract
Antioxidant proteins are involved importantly in many aspects of cellular life activities. They protect the cell and DNA from oxidative substances (such as peroxide, nitric oxide, oxygen-free radicals, etc.) which are known as reactive oxygen species (ROS). Free radical generation and antioxidant defenses are opposing factors in the human body and the balance between them is necessary to maintain a healthy body. An unhealthy routine or the degeneration of age can break the balance, leading to more ROS than antioxidants, causing damage to health. In general, the antioxidant mechanism is the combination of antioxidant molecules and ROS in a one-electron reaction. Creating computational models to promptly identify antioxidant candidates is essential in supporting antioxidant detection experiments in the laboratory. In this study, we proposed a machine learning-based model for this prediction purpose from a benchmark set of sequencing data. The experiments were conducted by using 10-fold cross-validation on the training process and validated by three different independent datasets. Different machine learning and deep learning algorithms have been evaluated on an optimal set of sequence features. Among them, Random Forest has been identified as the best model to identify antioxidant proteins with the highest performance. Our optimal model achieved high accuracy of 84.6%, as well as a balance in sensitivity (81.5%) and specificity (85.1%) for antioxidant protein identification on the training dataset. The performance results from different independent datasets also showed the significance in our model compared to previously published works on antioxidant protein identification.
Collapse
|
31
|
Bi Y, Xiang D, Ge Z, Li F, Jia C, Song J. An Interpretable Prediction Model for Identifying N 7-Methylguanosine Sites Based on XGBoost and SHAP. MOLECULAR THERAPY. NUCLEIC ACIDS 2020; 22:362-372. [PMID: 33230441 PMCID: PMC7533297 DOI: 10.1016/j.omtn.2020.08.022] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Accepted: 08/20/2020] [Indexed: 12/19/2022]
Abstract
Recent studies have increasingly shown that the chemical modification of mRNA plays an important role in the regulation of gene expression. N7-methylguanosine (m7G) is a type of positively-charged mRNA modification that plays an essential role for efficient gene expression and cell viability. However, the research on m7G has received little attention to date. Bioinformatics tools can be applied as auxiliary methods to identify m7G sites in transcriptomes. In this study, we develop a novel interpretable machine learning-based approach termed XG-m7G for the differentiation of m7G sites using the XGBoost algorithm and six different types of sequence-encoding schemes. Both 10-fold and jackknife cross-validation tests indicate that XG-m7G outperforms iRNA-m7G. Moreover, using the powerful SHAP algorithm, this new framework also provides desirable interpretations of the model performance and highlights the most important features for identifying m7G sites. XG-m7G is anticipated to serve as a useful tool and guide for researchers in their future studies of mRNA modification sites.
Collapse
Affiliation(s)
- Yue Bi
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Dongxu Xiang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Zongyuan Ge
- Monash e-Research Centre and Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
32
|
Li FM, Gao XW. Predicting Gram-Positive Bacterial Protein Subcellular Location by Using Combined Features. BIOMED RESEARCH INTERNATIONAL 2020; 2020:9701734. [PMID: 32802888 PMCID: PMC7421015 DOI: 10.1155/2020/9701734] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/23/2020] [Revised: 06/30/2020] [Accepted: 07/13/2020] [Indexed: 12/14/2022]
Abstract
There are a lot of bacteria in the environment, and Gram-positive bacteria are the most common ones. Some Gram-positive bacteria are very harmful to the human body, so it is significant to predict Gram-positive bacterial protein subcellular location. And identification of Gram-positive bacterial protein subcellular location is important for developing effective drugs. In this paper, a new Gram-positive bacterial protein subcellular location dataset was established. The amino acid composition, the gene ontology annotation information, the hydropathy dipeptide composition information, the amino acid dipeptide composition information, and the autocovariance average chemical shift information were selected as characteristic parameters, then these parameters were combined. The locations of Gram-positive bacterial proteins were predicted by the Support Vector Machine (SVM) algorithm, and the overall accuracy (OA) reached 86.1% under the Jackknife test. The overall accuracy (OA) in our predictive model was higher than those in existing methods. This improved method may be helpful for protein function prediction.
Collapse
Affiliation(s)
- Feng-Min Li
- College of Science, Inner Mongolia Agricultural University, Hohhot 010018, China
| | - Xiao-Wei Gao
- College of Science, Inner Mongolia Agricultural University, Hohhot 010018, China
| |
Collapse
|
33
|
Li X, Tang Q, Tang H, Chen W. Identifying Antioxidant Proteins by Combining Multiple Methods. Front Bioeng Biotechnol 2020; 8:858. [PMID: 32793581 PMCID: PMC7391787 DOI: 10.3389/fbioe.2020.00858] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2020] [Accepted: 07/03/2020] [Indexed: 11/13/2022] Open
Abstract
Antioxidant proteins play important roles in preventing free radical oxidation from damaging cells and DNA. They have become ideal candidates of disease prevention and treatment. Therefore, it is urgent to identify antioxidants from natural compounds. Since experimental methods are still cost ineffective, a series of computational methods have been proposed to identify antioxidant proteins. However, the performance of the current methods are still not satisfactory. In this study, a support vector machine based method, called Vote9, was proposed to identify antioxidants, in which the sequences were encoded by using the features generated from 9 optimal individual models. Results from jackknife test demonstrated that Vote9 is comparable with the best one of the existing predictors for this task. We hope that Vote9 will become a useful tool or at least can play a complementary role to the existing methods for identifying antioxidants.
Collapse
Affiliation(s)
- Xianhai Li
- School of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China.,Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Qiang Tang
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Hua Tang
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Wei Chen
- School of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China.,Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China.,School of Life Sciences, Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan, China
| |
Collapse
|
34
|
Feng P, Feng L. Recent Advances on Antioxidant Identification Based on Machine Learning Methods. Curr Drug Metab 2020; 21:804-809. [PMID: 32682368 DOI: 10.2174/1389200221666200719001449] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2020] [Revised: 03/17/2020] [Accepted: 05/13/2020] [Indexed: 11/22/2022]
Abstract
Antioxidants are molecules that can prevent damages to cells caused by free radicals. Recent studies also demonstrated that antioxidants play roles in preventing diseases. However, the number of known molecules with antioxidant activity is very small. Therefore, it is necessary to identify antioxidants from various resources. In the past several years, a series of computational methods have been proposed to identify antioxidants. In this review, we briefly summarized recent advances in computationally identifying antioxidants. The challenges and future perspectives for identifying antioxidants were also discussed. We hope this review will provide insights into researches on antioxidant identification.
Collapse
Affiliation(s)
- Pengmian Feng
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| | - Lijing Feng
- School of Sciences, North China University of Science and Technology, Tangshan 063000, China
| |
Collapse
|
35
|
Missing Value Estimation Methods Research for Arrhythmia Classification Using the Modified Kernel Difference-Weighted KNN Algorithms. BIOMED RESEARCH INTERNATIONAL 2020; 2020:7141725. [PMID: 32685521 PMCID: PMC7327608 DOI: 10.1155/2020/7141725] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Revised: 05/07/2020] [Accepted: 05/09/2020] [Indexed: 12/12/2022]
Abstract
Electrocardiogram (ECG) signal is critical to the classification of cardiac arrhythmia using some machine learning methods. In practice, the ECG datasets are usually with multiple missing values due to faults or distortion. Unfortunately, many established algorithms for classification require a fully complete matrix as input. Thus it is necessary to impute the missing data to increase the effectiveness of classification for datasets with a few missing values. In this paper, we compare the main methods for estimating the missing values in electrocardiogram data, e.g., the “Zero method”, “Mean method”, “PCA-based method”, and “RPCA-based method” and then propose a novel KNN-based classification algorithm, i.e., a modified kernel Difference-Weighted KNN classifier (MKDF-WKNN), which is fit for the classification of imbalance datasets. The experimental results on the UCI database indicate that the “RPCA-based method” can successfully handle missing values in arrhythmia dataset no matter how many values in it are missing and our proposed classification algorithm, MKDF-WKNN, is superior to other state-of-the-art algorithms like KNN, DS-WKNN, DF-WKNN, and KDF-WKNN for uneven datasets which impacts the accuracy of classification.
Collapse
|
36
|
Meng C, Guo F, Zou Q. CWLy-SVM: A support vector machine-based tool for identifying cell wall lytic enzymes. Comput Biol Chem 2020; 87:107304. [PMID: 32580129 DOI: 10.1016/j.compbiolchem.2020.107304] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2019] [Revised: 06/07/2020] [Accepted: 06/08/2020] [Indexed: 12/21/2022]
Abstract
Cell wall lytic enzymes, as an important biotechnical tool in drug development, agriculture and the food industry, have attracted more research attention. In this research, the accurate identification of cell wall lytic enzymes is one of the key and fundamental tasks. In this study, in order to eliminate the inefficiency of in vitro experiments, a support vector machine-based cell wall lytic enzyme identification model was constructed using bioinformatics. This machine learning process includes feature extraction, feature selection, model training and optimization. According to the jackknife cross validation test, this model obtained a sensitivity of 0.853, a specificity of 0.977, an MCC of 0.845 and an AUC of 0.915. These benchmark results demonstrate that the proposed model outperforms the state-of-the-art method and that it has powerful cell wall lytic enzyme identification ability. Furthermore, we comprehensively analyzed the selected optimal features and used the proposed model to construct a user friendly web server called the CWLy-SVM to identify cell wall lytic enzymes, which is available at http://server.malab.cn/CWLy-SVM/index.jsp.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Intelligence and Computing, Tianjin University, Tianjin, China; College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China; Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.
| |
Collapse
|
37
|
Dao FY, Lv H, Yang YH, Zulfiqar H, Gao H, Lin H. Computational identification of N6-methyladenosine sites in multiple tissues of mammals. Comput Struct Biotechnol J 2020; 18:1084-1091. [PMID: 32435427 PMCID: PMC7229270 DOI: 10.1016/j.csbj.2020.04.015] [Citation(s) in RCA: 72] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2020] [Revised: 04/20/2020] [Accepted: 04/21/2020] [Indexed: 12/12/2022] Open
Abstract
N6-methyladenosine (m6A) is the methylation of the adenosine at the nitrogen-6 position, which is the most abundant RNA methylation modification and involves a series of important biological processes. Accurate identification of m6A sites in genome-wide is invaluable for better understanding their biological functions. In this work, an ensemble predictor named iRNA-m6A was established to identify m6A sites in multiple tissues of human, mouse and rat based on the data from high-throughput sequencing techniques. In the proposed predictor, RNA sequences were encoded by physical-chemical property matrix, mono-nucleotide binary encoding and nucleotide chemical property. Subsequently, these features were optimized by using minimum Redundancy Maximum Relevance (mRMR) feature selection method. Based on the optimal feature subset, the best m6A classification models were trained by Support Vector Machine (SVM) with 5-fold cross-validation test. Prediction results on independent dataset showed that our proposed method could produce the excellent generalization ability. We also established a user-friendly webserver called iRNA-m6A which can be freely accessible at http://lin-group.cn/server/iRNA-m6A. This tool will provide more convenience to users for studying m6A modification in different tissues.
Collapse
Affiliation(s)
| | | | - Yu-He Yang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hasan Zulfiqar
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Gao
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
38
|
PredDBP-Stack: Prediction of DNA-Binding Proteins from HMM Profiles using a Stacked Ensemble Method. BIOMED RESEARCH INTERNATIONAL 2020; 2020:7297631. [PMID: 32352006 PMCID: PMC7174956 DOI: 10.1155/2020/7297631] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Accepted: 04/01/2020] [Indexed: 12/02/2022]
Abstract
DNA-binding proteins (DBPs) play vital roles in all aspects of genetic activities. However, the identification of DBPs by using wet-lab experimental approaches is often time-consuming and laborious. In this study, we develop a novel computational method, called PredDBP-Stack, to predict DBPs solely based on protein sequences. First, amino acid composition (AAC) and transition probability composition (TPC) extracted from the hidden markov model (HMM) profile are adopted to represent a protein. Next, we establish a stacked ensemble model to identify DBPs, which involves two stages of learning. In the first stage, the four base classifiers are trained with the features of HMM-based compositions. In the second stage, the prediction probabilities of these base classifiers are used as inputs to the meta-classifier to perform the final prediction of DBPs. Based on the PDB1075 benchmark dataset, we conduct a jackknife cross validation with the proposed PredDBP-Stack predictor and obtain a balanced sensitivity and specificity of 92.47% and 92.36%, respectively. This outcome outperforms most of the existing classifiers. Furthermore, our method also achieves superior performance and model robustness on the PDB186 independent dataset. This demonstrates that the PredDBP-Stack is an effective classifier for accurately identifying DBPs based on protein sequence information alone.
Collapse
|
39
|
Meng C, Hu Y, Zhang Y, Guo F. PSBP-SVM: A Machine Learning-Based Computational Identifier for Predicting Polystyrene Binding Peptides. Front Bioeng Biotechnol 2020; 8:245. [PMID: 32296690 PMCID: PMC7137786 DOI: 10.3389/fbioe.2020.00245] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Accepted: 03/09/2020] [Indexed: 12/11/2022] Open
Abstract
Polystyrene binding peptides (PSBPs) play a key role in the immobilization process. The correct identification of PSBPs is the first step of all related works. In this paper, we proposed a novel support vector machine-based bioinformatic identification model. This model contains four machine learning steps, including feature extraction, feature selection, model training and optimization. In a five-fold cross validation test, this model achieves 90.38, 84.62, 87.50, and 0.90% SN, SP, ACC, and AUC, respectively. The performance of this model outperforms the state-of-the-art identifier in terms of the SN and ACC with a smaller feature set. Furthermore, we constructed a web server that includes the proposed model, which is freely accessible at http://server.malab.cn/PSBP-SVM/index.jsp.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Intelligence and Computing, Tianjin University, Tianjin, China.,College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Yang Hu
- School of Life Sciences and Technology, Harbin Institute of Technology, Harbin, China
| | - Ying Zhang
- Department of Pharmacy, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
40
|
Zhang ZM, Tan JX, Wang F, Dao FY, Zhang ZY, Lin H. Early Diagnosis of Hepatocellular Carcinoma Using Machine Learning Method. Front Bioeng Biotechnol 2020; 8:254. [PMID: 32292778 PMCID: PMC7122481 DOI: 10.3389/fbioe.2020.00254] [Citation(s) in RCA: 56] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/12/2020] [Indexed: 12/18/2022] Open
Abstract
Hepatocellular carcinoma (HCC) is a serious cancer which ranked the fourth in cancer-related death worldwide. Hence, more accurate diagnostic models are urgently needed to aid the early HCC diagnosis under clinical scenarios and thus improve HCC treatment and survival. Several conventional methods have been used for discriminating HCC from cirrhosis tissues in patients without HCC (CwoHCC). However, the recognition successful rates are still far from satisfactory. In this study, we applied a computational approach that based on machine learning method to a set of microarray data generated from 1091 HCC samples and 242 CwoHCC samples. The within-sample relative expression orderings (REOs) method was used to extract numerical descriptors from gene expression profiles datasets. After removing the unrelated features by using maximum redundancy minimum relevance (mRMR) with incremental feature selection, we achieved “11-gene-pair” which could produce outstanding results. We further investigated the discriminate capability of the “11-gene-pair” for HCC recognition on several independent datasets. The wonderful results were obtained, demonstrating that the selected gene pairs can be signature for HCC. The proposed computational model can discriminate HCC and adjacent non-cancerous tissues from CwoHCC even for minimum biopsy specimens and inaccurately sampled specimens, which can be practical and effective for aiding the early HCC diagnosis at individual level.
Collapse
Affiliation(s)
- Zi-Mei Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Jiu-Xin Tan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Fang Wang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zhao-Yue Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
41
|
Sun S, Ding H, Wang D, Han S. Identifying Antifreeze Proteins Based on Key Evolutionary Information. Front Bioeng Biotechnol 2020; 8:244. [PMID: 32274383 PMCID: PMC7113384 DOI: 10.3389/fbioe.2020.00244] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2020] [Accepted: 03/09/2020] [Indexed: 01/08/2023] Open
Abstract
Antifreeze proteins are important antifreeze materials that have been widely used in industry, including in cryopreservation, de-icing, and food storage applications. However, the quantity of some commercially produced antifreeze proteins is insufficient for large-scale industrial applications. Further, many antifreeze proteins have properties such as cytotoxicity, severely hindering their applications. Understanding the mechanisms underlying the protein-ice interactions and identifying novel antifreeze proteins are, therefore, urgently needed. In this study, to uncover the mechanisms underlying protein-ice interactions and provide an efficient and accurate tool for identifying antifreeze proteins, we assessed various evolutionary features based on position-specific scoring matrices (PSSMs) and evaluated their importance for discriminating of antifreeze and non-antifreeze proteins. We then parsimoniously selected seven key features with the highest importance. We found that the selected features showed opposite tendencies (regarding the conservation of certain amino acids) between antifreeze and non-antifreeze proteins. Five out of the seven features had relatively high contributions to the discrimination of antifreeze and non-antifreeze proteins, as revealed by a principal component analysis, i.e., the conservation of the replacement of Cys, Trp, and Gly in antifreeze proteins by Ala, Met, and Ala, respectively, in the related proteins, and the conservation of the replacement of Arg in non-antifreeze proteins by Ser and Arg in the related proteins. Based on the seven parsimoniously selected key features, we established a classifier using support vector machine, which outperformed the state-of-the-art tools. These results suggest that understanding evolutionary information is crucial to designing accurate automated methods for discriminating antifreeze and non-antifreeze proteins. Our classifier, therefore, is an efficient tool for annotating new proteins with antifreeze functions based on sequence information and can facilitate their application in industry.
Collapse
Affiliation(s)
- Shanwen Sun
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Hui Ding
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Donghua Wang
- Department of General Surgery, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Shuguang Han
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
42
|
Hou R, Wang L, Wu YJ. Predicting ATP-Binding Cassette Transporters Using the Random Forest Method. Front Genet 2020; 11:156. [PMID: 32269586 PMCID: PMC7109328 DOI: 10.3389/fgene.2020.00156] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2019] [Accepted: 02/11/2020] [Indexed: 12/21/2022] Open
Abstract
ATP-binding cassette (ABC) proteins play important roles in a wide variety of species. These proteins are involved in absorbing nutrients, exporting toxic substances, and regulating potassium channels, and they contribute to drug resistance in cancer cells. Therefore, the identification of ABC transporters is an urgent task. The present study used 188D as the feature extraction method, which is based on sequence information and physicochemical properties. We also visualized the feature extracted by t-Distributed Stochastic Neighbor Embedding (t-SNE). The sample based on the features extracted by 188D may be separated. Further, random forest (RF) is an efficient classifier to identify proteins. Under the 10-fold cross-validation of the model proposed here for a training set, the average accuracy rate of 10 training sets was 89.54%. We obtained values of 0.87 for specificity, 0.92 for sensitivity, and 0.79 for MCC. In the testing set, the accuracy achieved was 89%. These results suggest that the model combining 188D with RF is an optimal tool to identify ABC transporters.
Collapse
Affiliation(s)
- Ruiyan Hou
- Laboratory of Molecular Toxicology, State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.,College of Life Science, University of Chinese Academy of Sciences, Beijing, China
| | - Lida Wang
- Department of Scientific Research, General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China
| | - Yi-Jun Wu
- Laboratory of Molecular Toxicology, State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
43
|
Fang C, Jia Y, Hu L, Lu Y, Wang H. IMPContact: An Interhelical Residue Contact Prediction Method. BIOMED RESEARCH INTERNATIONAL 2020; 2020:4569037. [PMID: 32309431 PMCID: PMC7140131 DOI: 10.1155/2020/4569037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Accepted: 03/09/2020] [Indexed: 11/17/2022]
Abstract
As an important category of proteins, alpha-helix transmembrane proteins (αTMPs) play an important role in various biological activities. Because the solved αTMP structures are inadequate, predicting the residue contacts among the transmembrane segments of an αTMP exhibits the basis of protein fold, which can be used to further discover more protein functions. A few efforts have been devoted to predict the interhelical residue contact using machine learning methods based on the prior knowledge of transmembrane protein structure. However, it is still a challenge to improve the prediction accuracy, while the deep learning method provides an opportunity to utilize the structural knowledge in a different insight. For this purpose, we proposed a novel αTMP residue-residue contact prediction method IMPContact, in which a convolutional neural network (CNN) was applied to recognize those interhelical contacts in a TMP using its specific structural features. There were four sequence-based TMP-specific features selected to descript a pair of residues, namely, evolutionary covariation, predicted topology structure, residue relative position, and evolutionary conservation. An up-to-date dataset was used to train and test the IMPContact; our method achieved better performance compared to peer methods. In the case studies, IHRCs in the regular transmembrane helixes were better predicted than in the irregular ones.
Collapse
Affiliation(s)
- Chao Fang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Yajie Jia
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Lihong Hu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Yinghua Lu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Department of Computer Science, College of Humanities & Sciences of Northeast Normal University, Changchun 130117, China
| | - Han Wang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
- Department of Computer Science, College of Humanities & Sciences of Northeast Normal University, Changchun 130117, China
| |
Collapse
|
44
|
Li HF, Wang XF, Tang H. Predicting Bacteriophage Enzymes and Hydrolases by Using Combined Features. Front Bioeng Biotechnol 2020; 8:183. [PMID: 32266225 PMCID: PMC7105632 DOI: 10.3389/fbioe.2020.00183] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2020] [Accepted: 02/24/2020] [Indexed: 12/19/2022] Open
Abstract
Bacteriophage is a type of virus that could infect the host bacteria. They have been applied in the treatment of pathogenic bacterial infection. Phage enzymes and hydrolases play the most important role in the destruction of bacterial cells. Correctly identifying the hydrolases coded by phage is not only beneficial to their function study, but also conducive to antibacteria drug discovery. Thus, this work aims to recognize the enzymes and hydrolases in phage. A combination of different features was used to represent samples of phage and hydrolase. A feature selection technique called analysis of variance was developed to optimize features. The classification was performed by using support vector machine (SVM). The prediction process includes two steps. The first step is to identify phage enzymes. The second step is to determine whether a phage enzyme is hydrolase or not. The jackknife cross-validated results showed that our method could produce overall accuracies of 85.1 and 94.3%, respectively, for the two predictions, demonstrating that the proposed method is promising.
Collapse
Affiliation(s)
- Hong-Fei Li
- Department of Pathophysiology, Key Laboratory of Medical Electrophysiology, Ministry of Education, Southwest Medical University, Luzhou, China.,School of Computer and Information Engineering, Henan Normal University, Henan, China
| | - Xian-Fang Wang
- School of Computer and Information Engineering, Henan Normal University, Henan, China
| | - Hua Tang
- Department of Pathophysiology, Key Laboratory of Medical Electrophysiology, Ministry of Education, Southwest Medical University, Luzhou, China
| |
Collapse
|
45
|
Xu Y, Wen Y, Han G. Antioxidant Proteins' Identification Based on Support Vector Machine. Comb Chem High Throughput Screen 2020; 23:319-325. [PMID: 32141416 DOI: 10.2174/1386207323666200306125538] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Revised: 12/23/2019] [Accepted: 01/13/2020] [Indexed: 12/26/2022]
Abstract
BACKGROUND Evidence have increasingly indicated that for human disease, cell metabolism are deeply associated with proteins. Structural mutations and dysregulations of these proteins contribute to the development of the complex disease. Free radicals are unstable molecules that seek for electrons from the surrounding atoms for stability. Once a free radical binds to an atom in the body, a chain reaction occurs, which causes damage to cells and DNA. An antioxidant protein is a substance that protects cells from free radical damage. Accurate identification of antioxidant proteins is important for understanding their role in delaying aging and preventing and treating related diseases. Therefore, computational methods to identify antioxidant proteins have become an effective prior-pinpointing approach to experimental verification. METHODS In this study, support vector machines was used to identify antioxidant proteins, using amino acid compositions and 9-gap dipeptide compositions as feature extraction, and feature reduction by Principal Component Analysis. RESULTS The prediction accuracy Acc of this experiment reached 98.38%, the recall rate Sn of the positive sample was found to be 99.27%, the recall rate Sp of the negative sample reached 97.54%, and the MCC value was 0.9678. To evaluate our proposed method, the predictive performance of 20 antioxidant proteins from the National Center for Biotechnology Information(NCBI) was studied. As a result, 20 antioxidant proteins were correctly predicted by our method. Experimental results demonstrate that the performance of our method is better than the state-of-the-art methods for identification of antioxidant proteins. CONCLUSION We collected experimental protein data from Uniport, including 253 antioxidant proteins and 1552 non-antioxidant proteins. The optimal feature extraction used in this paper is composed of amino acid composition and 9-gap dipeptide. The protein is identified by support vector machine, and the model evaluation index is obtained based on 5-fold cross-validation. Compared with the existing classification model, it is further explained that the SVM recognition model constructed in this paper is helpful for the recognition of antioxidized proteins.
Collapse
Affiliation(s)
- Yuanke Xu
- School of Mathematics and Computational Science, Xiangtan University, Hunan, China
| | - Yaping Wen
- School of Mathematics and Computational Science, Xiangtan University, Hunan, China
| | - Guosheng Han
- School of Mathematics and Computational Science, Xiangtan University, Hunan, China
| |
Collapse
|
46
|
Lv Z, Zhang J, Ding H, Zou Q. RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites. Front Bioeng Biotechnol 2020; 8:134. [PMID: 32175316 PMCID: PMC7054385 DOI: 10.3389/fbioe.2020.00134] [Citation(s) in RCA: 66] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Accepted: 02/10/2020] [Indexed: 12/21/2022] Open
Abstract
One of the ubiquitous chemical modifications in RNA, pseudouridine modification is crucial for various cellular biological and physiological processes. To gain more insight into the functional mechanisms involved, it is of fundamental importance to precisely identify pseudouridine sites in RNA. Several useful machine learning approaches have become available recently, with the increasing progress of next-generation sequencing technology; however, existing methods cannot predict sites with high accuracy. Thus, a more accurate predictor is required. In this study, a random forest-based predictor named RF-PseU is proposed for prediction of pseudouridylation sites. To optimize feature representation and obtain a better model, the light gradient boosting machine algorithm and incremental feature selection strategy were used to select the optimum feature space vector for training the random forest model RF-PseU. Compared with previous state-of-the-art predictors, the results on the same benchmark data sets of three species demonstrate that RF-PseU performs better overall. The integrated average leave-one-out cross-validation and independent testing accuracy scores were 71.4% and 74.7%, respectively, representing increments of 3.63% and 4.77% versus the best existing predictor. Moreover, the final RF-PseU model for prediction was built on leave-one-out cross-validation and provides a reliable and robust tool for identifying pseudouridine sites. A web server with a user-friendly interface is accessible at http://148.70.81.170:10228/rfpseu.
Collapse
Affiliation(s)
- Zhibin Lv
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Jun Zhang
- Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Hui Ding
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
47
|
Zhang ZY, Yang YH, Ding H, Wang D, Chen W, Lin H. Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Brief Bioinform 2020; 22:526-535. [PMID: 31994694 DOI: 10.1093/bib/bbz177] [Citation(s) in RCA: 87] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Revised: 11/05/2019] [Accepted: 11/21/2019] [Indexed: 12/14/2022] Open
Abstract
Messenger RNAs (mRNAs) shoulder special responsibilities that transmit genetic code from DNA to discrete locations in the cytoplasm. The locating process of mRNA might provide spatial and temporal regulation of mRNA and protein functions. The situ hybridization and quantitative transcriptomics analysis could provide detail information about mRNA subcellular localization; however, they are time consuming and expensive. It is highly desired to develop computational tools for timely and effectively predicting mRNA subcellular location. In this work, by using binomial distribution and one-way analysis of variance, the optimal nonamer composition was obtained to represent mRNA sequences. Subsequently, a predictor based on support vector machine was developed to identify the mRNA subcellular localization. In 5-fold cross-validation, results showed that the accuracy is 90.12% for Homo sapiens (H. sapiens). The predictor may provide a reference for the study of mRNA localization mechanisms and mRNA translocation strategies. An online web server was established based on our models, which is available at http://lin-group.cn/server/iLoc-mRNA/.
Collapse
Affiliation(s)
- Zhao-Yue Zhang
- Center for Informational Biology at University of Electronic Science and Technology of China
| | - Yu-He Yang
- Center for Informational Biology at University of Electronic Science and Technology of China
| | - Hui Ding
- Center for Informational Biology at University of Electronic Science and Technology of China
| | - Dong Wang
- Department of Bioinformatics at Southern Medical University
| | - Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy at Chengdu University of Traditional Chinese Medicine
| | - Hao Lin
- Center for Informational Biology at University of Electronic Science and Technology of China
| |
Collapse
|
48
|
Miao YY, Zhao W, Li GP, Gao Y, Du PF. Predicting Endoplasmic Reticulum Resident Proteins Using Auto-Cross Covariance Transformation With a U-Shaped Residue Weight-Transfer Function. Front Genet 2020; 10:1231. [PMID: 31921288 PMCID: PMC6932965 DOI: 10.3389/fgene.2019.01231] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 11/06/2019] [Indexed: 11/13/2022] Open
Abstract
Background: The endoplasmic reticulum (ER) is an important organelle in eukaryotic cells. It is involved in many important biological processes, such as cell metabolism, protein synthesis, and post-translational modification. The proteins that reside within the ER are called ER-resident proteins. These proteins are closely related to the biological functions of the ER. The difference between the ER-resident proteins and other non-resident proteins should be carefully studied. Methods: We developed a support vector machine (SVM)-based method. We developed a U-shaped weight-transfer function and used it, along with the positional-specific physiochemical properties (PSPCP), to integrate together sequence order information, signaling peptides information, and evolutionary information. Result: Our method achieved over 86% accuracy in a jackknife test. We also achieved roughly 86% sensitivity and 67% specificity in an independent dataset test. Our method is capable of identifying ER-resident proteins.
Collapse
Affiliation(s)
- Yang-Yang Miao
- College of Intelligence and Computing, Tianjin University, Tianjin, China.,School of Chemical Engineering, Tianjin University, Tianjin, China
| | - Wei Zhao
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Guang-Ping Li
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yang Gao
- School of Medicine, Nankai University, Tianjin, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|