1
|
Li J, Zhang F, Wen Z, Fang C. AFP-MCDF: Multi and cross-dimensional feature fusion methods for antifreeze protein prediction. Anal Biochem 2025; 704:115881. [PMID: 40348048 DOI: 10.1016/j.ab.2025.115881] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2025] [Revised: 04/22/2025] [Accepted: 04/23/2025] [Indexed: 05/14/2025]
Abstract
Antifreeze proteins can effectively inhibit the formation of ice crystals and enhance cell survival in low-temperature environments. They protect the texture prolong the shelf life of food and maintain cell and tissue integrity in medical treatments, thereby improving the success rate of surgery and transplantation. Accurate prediction of Antifreeze proteins is important to advance these fields. Traditional wet-experiment methods, while providing reliable validation results, are usually time-consuming and costly. And existing computational methods still have room for improvement in predicting performance. In this study, a novel antifreeze protein prediction method, AFP-MCDF, is proposed. The AFP-MCDF method first extracts one- and two-dimensional feature representations of Antifreeze protein sequences using the pre-trained protein language models ProtBERT and ESM-2. Subsequently, these features are fused multidimensionally via BiLSTM and TextCNN to capture long-term dependencies and local features. Finally, the method predicts the frost resistance of Antifreeze protein sequences by cross-dimensional fusion and linear mapping from N to 2 dimensions. Experimental results show that AFP-MCDF performs well in the antifreeze protein prediction task, outperforming traditional computational methods and reaching the current state-of-the-art.
Collapse
Affiliation(s)
- Jinfeng Li
- Beijing Institute of Petrochemical Technology, Beijing, 102617, China
| | - Fan Zhang
- Beijing Institute of Petrochemical Technology, Beijing, 102617, China
| | - Zhenguo Wen
- Beijing Institute of Petrochemical Technology, Beijing, 102617, China
| | - Chun Fang
- Beijing Institute of Petrochemical Technology, Beijing, 102617, China.
| |
Collapse
|
2
|
Chen S, Zheng P, Zheng L, Yao Q, Meng Z, Lin L, Chen X, Liu R. BERT-DomainAFP: Antifreeze protein recognition and classification model based on BERT and structural domain annotation. iScience 2025; 28:112077. [PMID: 40241758 PMCID: PMC12002629 DOI: 10.1016/j.isci.2025.112077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2024] [Revised: 01/03/2025] [Accepted: 02/17/2025] [Indexed: 04/18/2025] Open
Abstract
Antifreeze proteins (AFPs) are crucial for organisms to adapt to low temperatures, with applications in medicine, food storage, aquaculture, and agriculture. Accurate AFP identification is challenging due to structural and sequence diversity. To improve prediction and classification, we propose BERT-DomainAFP, a deep learning model trained on the AntiFreezeDomains dataset created with a novel annotation strategy. The model uses pre-trained ProteinBERT and incorporates oversampling and undersampling techniques to handle unbalanced data, ensuring high predictive ability. BERT-DomainAFP achieves 98.48% accuracy, the highest among existing models, and can classify different AFP types based on structural domain features. This model outperforms current tools, offering a promising solution for AFP recognition and classification in research and applications.
Collapse
Affiliation(s)
- Shengzhen Chen
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Ping Zheng
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Lele Zheng
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Qinglong Yao
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Ziyu Meng
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Longshan Lin
- Laboratory of Marine Biodiversity Research, Third Institute of Oceanography, Ministry of Natural Resources, Xiamen 361005, China
| | - Xinhua Chen
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Ruoyu Liu
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| |
Collapse
|
3
|
Lv Z, Wei M, Pei H, Peng S, Li M, Jiang L. PTSP-BERT: Predict the thermal stability of proteins using sequence-based bidirectional representations from transformer-embedded features. Comput Biol Med 2025; 185:109598. [PMID: 39708499 DOI: 10.1016/j.compbiomed.2024.109598] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 12/16/2024] [Accepted: 12/17/2024] [Indexed: 12/23/2024]
Abstract
Thermophilic proteins, mesophiles proteins and psychrophilic proteins have wide industrial applications, as enzymes with different optimal temperatures are often needed for different purposes. Convenient methods are needed to determine the optimal temperatures for proteins; however, laboratory methods for this purpose are time-consuming and laborious, and existing machine learning methods can only perform binary classification of thermophilic and non-thermophilic proteins, or psychrophilic and non-psychrophilic proteins. Here, we developed a deep learning model, PSTP-BERT, based on protein sequences that can directly perform Three classes identification of thermophilic, mesophilic, and psychrophilic proteins. By comparing BERT-bfd with other deep learning models using five-fold cross-validation, we found that BERT-bfd-extracted features achieved the highest accuracy under six classifiers. Furthermore, to improve the model's accuracy, we used SMOTE (synthetic minority oversampling technique) to balance the dataset and light gradient-boosting machine to rank BERT-bfd-extracted features according to their weights. We obtained the best-performing model with five-fold cross-validation accuracy of 89.59 % and independent test accuracy of 85.42 %. The performance of the PSTP-BERT is significantly better than that of existing models in Three classes identification task. In order to compare with previous binary classification models, we used PSTP-BERT to perform binary classification tasks of thermophilic and non-thermophilic protein, and psychrophilic and non-psychrophilic protein on an independent test set. PSTP-BERT achieved the highest accuracy on both binary classification tasks, with an accuracy of 93.33 % for thermophilic protein binary classification and 88.33 % for psychrophilic protein binary classification. The accuracy of the independent test of the model can reach between 89.8 % and 92.9 % after training and optimization of the training set with different sequence similarities, and the prediction accuracy of the new data can exceed 97 %. For the convenience of future researchers to use and reference, we have uploaded source code of PSTP-BERT to GitHub.
Collapse
Affiliation(s)
- Zhibin Lv
- College of Biomedical Engineering, Sichuan University, Chengdu, 610065, China.
| | - Mingxuan Wei
- College of Biomedical Engineering, Sichuan University, Chengdu, 610065, China
| | - Hongdi Pei
- Department of Biomedical Engineering, Johns Hopkins University, MD, 21218, USA
| | - Shiyu Peng
- College of Biomedical Engineering, Sichuan University, Chengdu, 610065, China
| | - Mingxin Li
- College of Biomedical Engineering, Sichuan University, Chengdu, 610065, China
| | - Liangzhen Jiang
- College of Food and Biological Engineering, Chengdu University, Chengdu, 610106, China; Country Key Laboratory of Coarse Cereal Processing, Ministry of Agriculture and Rural Affairs, Chengdu, 610106, China
| |
Collapse
|
4
|
Kumar N, Choudhury S, Bajiya N, Patiyal S, Raghava GPS. Prediction of Anti-Freezing Proteins From Their Evolutionary Profile. Proteomics 2025; 25:e202400157. [PMID: 39305039 DOI: 10.1002/pmic.202400157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 08/29/2024] [Accepted: 08/29/2024] [Indexed: 02/06/2025]
Abstract
Prediction of antifreeze proteins (AFPs) holds significant importance due to their diverse applications in healthcare. An inherent limitation of current AFP prediction methods is their reliance on unreviewed proteins for evaluation. This study evaluates, proposed and existing methods on an independent dataset containing 80 AFPs and 73 non-AFPs obtained from Uniport, which have been already reviewed by experts. Initially, we constructed machine learning models for AFP prediction using selected composition-based protein features and achieved a peak AUROC of 0.90 with an MCC of 0.69 on the independent dataset. Subsequently, we observed a notable enhancement in model performance, with the AUROC increasing from 0.90 to 0.93 upon incorporating evolutionary information instead of relying solely on the primary sequence of proteins. Furthermore, we explored hybrid models integrating our machine learning approaches with BLAST-based similarity and motif-based methods. However, the performance of these hybrid models either matched or was inferior to that of our best machine-learning model. Our best model based on evolutionary information outperforms all existing methods on independent/validation dataset. To facilitate users, a user-friendly web server with a standalone package named "AFPropred" was developed (https://webs.iiitd.edu.in/raghava/afpropred).
Collapse
Affiliation(s)
- Nishant Kumar
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Shubham Choudhury
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Nisha Bajiya
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| |
Collapse
|
5
|
Qi D, Liu T. VotePLMs-AFP: Identification of antifreeze proteins using transformer-embedding features and ensemble learning. Biochim Biophys Acta Gen Subj 2024; 1868:130721. [PMID: 39426757 DOI: 10.1016/j.bbagen.2024.130721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Revised: 09/24/2024] [Accepted: 10/11/2024] [Indexed: 10/21/2024]
Abstract
Antifreeze proteins (AFPs) are a unique class of biomolecules capable of protecting other proteins, cell membranes, and cellular structures within organisms from damage caused by freezing conditions. Given the significance of AFPs in various domains such as biotechnology, agriculture, and medicine, several machine learning methods have been developed to identify AFPs. However, due to the complexity and diversity of AFPs, the predictive performance of existing methods is limited. Therefore, there is an urgent need to develop an efficient and rapid computational method for accurately predicting AFPs. In this study, we proposed a novel predictor based on transformer-embedding features and ensemble learning for the identification of AFPs, termed VotePLMs-AFP. Firstly, three types of feature descriptors were extracted from pre-trained protein language models (PLMs) during the feature extraction process. Subsequently, we analyzed six combinations generated by these three embeddings to explore the optimal feature set, which was input into the soft voting-based ensemble learning classifier for the identification of AFPs. Finally, we evaluated the model on the two benchmark datasets. The experimental results show that our model achieves high prediction accuracy in 10-fold cross-validation (CV) and independent set testing, outperforming existing state-of-the-art methods. Therefore, our model could serve as an effective tool for predicting AFPs.
Collapse
Affiliation(s)
- Dawei Qi
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China.
| |
Collapse
|
6
|
Wu J, Liu Y, Zhu Y, Yu DJ. Improving Antifreeze Proteins Prediction With Protein Language Models and Hybrid Feature Extraction Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2349-2358. [PMID: 39316498 DOI: 10.1109/tcbb.2024.3467261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2024]
Abstract
Accurate identification of antifreeze proteins (AFPs) is crucial in developing biomimetic synthetic anti-icing materials and low-temperature organ preservation materials. Although numerous machine learning-based methods have been proposed for AFPs prediction, the complex and diverse nature of AFPs limits the prediction performance of existing methods. In this study, we propose AFP-Deep, a new deep learning method to predict antifreeze proteins by integrating embedding from protein sequences with pre-trained protein language models and evolutionary contexts with hybrid feature extraction networks. The experimental results demonstrated that the main advantage of AFP-Deep is its utilization of pre-trained protein language models, which can extract discriminative global contextual features from protein sequences. Additionally, the hybrid deep neural networks designed for protein language models and evolutionary context feature extraction enhance the correlation between embeddings and antifreeze pattern. The performance evaluation results show that AFP-Deep achieves superior performance compared to state-of-the-art models on benchmark datasets, achieving an AUPRC of 0.724 and 0.924, respectively.
Collapse
|
7
|
Feng C, Wei H, Li X, Feng B, Xu C, Zhu X, Liu R. A stacking-based algorithm for antifreeze protein identification using combined physicochemical, pseudo amino acid composition, and reduction property features. Comput Biol Med 2024; 176:108534. [PMID: 38754217 DOI: 10.1016/j.compbiomed.2024.108534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Revised: 04/03/2024] [Accepted: 04/28/2024] [Indexed: 05/18/2024]
Abstract
Antifreeze proteins have wide applications in the medical and food industries. In this study, we propose a stacking-based classifier that can effectively identify antifreeze proteins. Initially, feature extraction was performed in three aspects: reduction properties, scalable pseudo amino acid composition, and physicochemical properties. A hybrid feature set comprised of the combined information from these three categories was obtained. Subsequently, we trained the training set based on LightGBM, XGBoost, and RandomForest algorithms, and the training outcomes were passed to the Logistic algorithm for matching, thereby establishing a stacking algorithm. The proposed algorithm was tested on the test set and an independent validation set. Experimental data indicates that the algorithm achieved a recognition accuracy of 98.3 %, and an accuracy of 98.5 % on the validation set. Lastly, we analyzed the reasons why numerical features achieved high recognition capabilities from multiple aspects. Data dimensionality reduction and the analysis from two-dimensional and three-dimensional views revealed separability between positive and negative samples, and the protein three-dimensional structure further demonstrated significant differences in related features between the two samples. Analysis of the classifier revealed that Hr*Hr, HrHr, and Sc-PseAAC_1, 188D(152,116,57,183) were among the seven most important numerical features affecting algorithm recognition. For Hr*Hr and HrHr, supportive sequence level evidence for the reduction dictionary was found in terms of conservation area analysis, multiple sequence alignment, and amino acid conservative substitution. Moreover, the importance of the reduction dictionary was recognized through a comparative analysis of importance before and after the reduction, realizing the effectiveness of the dictionary in improving feature importance. A decision tree model has been utilized to discern the distinctions between dipeptides associated with the physical and chemical properties of His(H), Iso(I), Leu(L), and Lys(K) and other dipeptides. We finally analyzed the other seven features of importance, and data analysis confirmed that hydrophobicity, secondary structure, charge properties, van der Waals forces, and solvent accessibility are also factors affecting the antifreeze capability of proteins.
Collapse
Affiliation(s)
- Changli Feng
- Department of Information Science and Technology, Taishan University, Taian, 271000, China.
| | - Haiyan Wei
- Department of Information Science and Technology, Taishan University, Taian, 271000, China.
| | - Xin Li
- Department of Information Science and Technology, Taishan University, Taian, 271000, China.
| | - Bin Feng
- Department of Information Science and Technology, Taishan University, Taian, 271000, China.
| | - Chugui Xu
- Department of Information Science and Technology, Taishan University, Taian, 271000, China.
| | - Xiaorong Zhu
- Department of Information Science and Technology, Taishan University, Taian, 271000, China.
| | - Ruijun Liu
- School of Software, Beihang University, Beijing, 100191, China.
| |
Collapse
|
8
|
Murmu S, Sinha D, Chaurasia H, Sharma S, Das R, Jha GK, Archak S. A review of artificial intelligence-assisted omics techniques in plant defense: current trends and future directions. FRONTIERS IN PLANT SCIENCE 2024; 15:1292054. [PMID: 38504888 PMCID: PMC10948452 DOI: 10.3389/fpls.2024.1292054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Accepted: 01/24/2024] [Indexed: 03/21/2024]
Abstract
Plants intricately deploy defense systems to counter diverse biotic and abiotic stresses. Omics technologies, spanning genomics, transcriptomics, proteomics, and metabolomics, have revolutionized the exploration of plant defense mechanisms, unraveling molecular intricacies in response to various stressors. However, the complexity and scale of omics data necessitate sophisticated analytical tools for meaningful insights. This review delves into the application of artificial intelligence algorithms, particularly machine learning and deep learning, as promising approaches for deciphering complex omics data in plant defense research. The overview encompasses key omics techniques and addresses the challenges and limitations inherent in current AI-assisted omics approaches. Moreover, it contemplates potential future directions in this dynamic field. In summary, AI-assisted omics techniques present a robust toolkit, enabling a profound understanding of the molecular foundations of plant defense and paving the way for more effective crop protection strategies amidst climate change and emerging diseases.
Collapse
Affiliation(s)
- Sneha Murmu
- Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research (ICAR), New Delhi, India
| | - Dipro Sinha
- Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research (ICAR), New Delhi, India
| | - Himanshushekhar Chaurasia
- Central Institute for Research on Cotton Technology, Indian Council of Agricultural Research (ICAR), Mumbai, India
| | - Soumya Sharma
- Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research (ICAR), New Delhi, India
| | - Ritwika Das
- Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research (ICAR), New Delhi, India
| | - Girish Kumar Jha
- Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research (ICAR), New Delhi, India
| | - Sunil Archak
- National Bureau of Plant Genetic Resources, Indian Council of Agricultural Research (ICAR), New Delhi, India
| |
Collapse
|
9
|
Box ICH, van der Burg KRL, Marshall KE. Analysis of Ice-Binding Protein Evolution. Methods Mol Biol 2024; 2730:219-229. [PMID: 37943462 DOI: 10.1007/978-1-0716-3503-2_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2023]
Abstract
Discovering novel ice-binding proteins (IBPs) is important for understanding the evolution of IBPs but it is difficult to determine where resources should be directed in the search for novel IBPs. For this reason, we developed a simple bioinformatic approach for aiding in the determination of where to direct efforts in the search for IBPs. First, BLAST is used to obtain a candidate list of putative IBPs. Next, phylogenetic trees are constructed to map the candidate list of putative IBPs to determine if any patterns are forming. These candidate putative IBPs and their patterns are then assessed through the production of ancestral sequences and reverse BLAST searches, in addition to the use of IBP calculators, to determine which sequences should be cut to produce the final putative IBP list. Finally, we explain an avenue to investigate these putative IBPs further for the development of hypotheses on their evolution.
Collapse
Affiliation(s)
- Isaiah C H Box
- Department of Zoology, University of British Columbia, Vancouver, BC, Canada
| | | | - Katie E Marshall
- Department of Zoology, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
10
|
Dhibar S, Jana B. Accurate Prediction of Antifreeze Protein from Sequences through Natural Language Text Processing and Interpretable Machine Learning Approaches. J Phys Chem Lett 2023; 14:10727-10735. [PMID: 38009833 DOI: 10.1021/acs.jpclett.3c02817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Antifreeze proteins (AFPs) bind to growing iceplanes owing to their structural complementarity nature, thereby inhibiting the ice-crystal growth by thermal hysteresis. Classification of AFPs from sequence is a difficult task due to their low sequence similarity, and therefore, the usual sequence similarity algorithms, like Blast and PSI-Blast, are not efficient. Here, a method combining n-gram feature vectors and machine learning models to accelerate the identification of potential AFPs from sequences is proposed. All these n-gram features are extracted from the K-mer counting method. The comparative analysis reveals that, among different machine learning models, Xgboost outperforms others in predicting AFPs from sequence when penta-mers are used as a feature vector. When tested on an independent dataset, our method performed better compared to other existing ones with sensitivity of 97.50%, recall of 98.30%, and f1 score of 99.10%. Further, we used the SHAP method, which provides important insight into the functional activity of AFPs.
Collapse
Affiliation(s)
- Saikat Dhibar
- School of Chemical Sciences, Indian Association for the Cultivation of Science, Jadavpur, Kolkata 700032, India
| | - Biman Jana
- School of Chemical Sciences, Indian Association for the Cultivation of Science, Jadavpur, Kolkata 700032, India
| |
Collapse
|
11
|
Butt AH, Alkhalifah T, Alturise F, Khan YD. Ensemble Learning for Hormone Binding Protein Prediction: A Promising Approach for Early Diagnosis of Thyroid Hormone Disorders in Serum. Diagnostics (Basel) 2023; 13:diagnostics13111940. [PMID: 37296792 DOI: 10.3390/diagnostics13111940] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 05/20/2023] [Accepted: 05/22/2023] [Indexed: 06/12/2023] Open
Abstract
Hormone-binding proteins (HBPs) are specific carrier proteins that bind to a given hormone. A soluble carrier hormone binding protein (HBP), which can interact non-covalently and specifically with growth hormone, modulates or inhibits hormone signaling. HBP is essential for the growth of life, despite still being poorly understood. Several diseases, according to some data, are caused by HBPs that express themselves abnormally. Accurate identification of these molecules is the first step in investigating the roles of HBPs and understanding their biological mechanisms. For a better understanding of cell development and cellular mechanisms, accurate HBP determination from a given protein sequence is essential. Using traditional biochemical experiments, it is difficult to correctly separate HBPs from an increasing number of proteins because of the high experimental costs and lengthy experiment periods. The abundance of protein sequence data that has been gathered in the post-genomic era necessitates a computational method that is automated and enables quick and accurate identification of putative HBPs within a large number of candidate proteins. A brand-new machine-learning-based predictor is suggested as the HBP identification method. To produce the desirable feature set for the method proposed, statistical moment-based features and amino acids were combined, and the random forest was used to train the feature set. During 5-fold cross validation experiments, the suggested method achieved 94.37% accuracy and 0.9438 F1-scores, respectively, demonstrating the importance of the Hahn moment-based features.
Collapse
Affiliation(s)
- Ahmad Hassan Butt
- Department of Computer Science, Faculty of Computing & Information Technology, University of the Punjab, Lahore 54000, Pakistan
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 51921, Qassim, Saudi Arabia
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 51921, Qassim, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan
| |
Collapse
|
12
|
Khan A, Uddin J, Ali F, Kumar H, Alghamdi W, Ahmad A. AFP-SPTS: An Accurate Prediction of Antifreeze Proteins Using Sequential and Pseudo-Tri-Slicing Evolutionary Features with an Extremely Randomized Tree. J Chem Inf Model 2023; 63:826-834. [PMID: 36649569 DOI: 10.1021/acs.jcim.2c01417] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
The development of intracellular ice in the bodies of cold-blooded living organisms may cause them to die. These species yield antifreeze proteins (AFPs) to live in subzero temperature environments. Additionally, AFPs are implemented in biotechnological, industrial, agricultural, and medical fields. Machine learning-based predictors were presented for AFP identification. However, more accurate predictors are still highly desirable for boosting the AFP prediction. This work presents a novel approach, named AFP-SPTS, for the correct prediction of AFPs. We explored the discriminative features with four schemes, namely, dipeptide deviation from the expected mean (DDE), reduced amino acid alphabet (RAAA), grouped dipeptide composition (GDPC), and a novel representative method, called pseudo-position-specific scoring matrix tri-slicing (PseTS-PSSM). Considering the advantages of ensemble learning strategy, we fused each feature vector into different combinations and trained the models with five machine learning algorithms, i.e., multilayer perceptron (MLP), extremely randomized tree (ERT), decision tree (DT), random forest (RF), and AdaBoost. Among all models, PseTS-PSSM + RAAA with an extremely randomized tree attained the best outcomes. The proposed predictor (AFP-SPTS) boosted the accuracies of AFPs in the literature by 1.82 and 4.1%.
Collapse
Affiliation(s)
- Adnan Khan
- Qurtuba University of Science and Information Technology, Peshawar5000, Khyber Pakhtunkhwa, Pakistan
| | - Jamal Uddin
- Qurtuba University of Science and Information Technology, Peshawar5000, Khyber Pakhtunkhwa, Pakistan
| | - Farman Ali
- Sarhad University of Science and Information Technology, Mardan Campus, Peshawar23200, Pakistan.,Department of Elementary and Secondary Education Department, Government of Khyber Pakhtunkhwa, Peshawar5000, Khyber Pakhtunkhwa, Pakistan
| | - Harish Kumar
- Department of Computer Science, College of Computer Science, King Khalid University, Abha61421, Saudi Arabia
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King AbdulAziz University, Jeddah21589, Saudi Arabia
| | - Aftab Ahmad
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan23200, Pakistan
| |
Collapse
|
13
|
Liu H, Zheng G, Chen Z, Ding X, Wu J, Zhang H, Jia S. Psychrophilic Yeasts: Insights into Their Adaptability to Extremely Cold Environments. Genes (Basel) 2023; 14:158. [PMID: 36672901 PMCID: PMC9859383 DOI: 10.3390/genes14010158] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Revised: 12/27/2022] [Accepted: 01/03/2023] [Indexed: 01/11/2023] Open
Abstract
Psychrophilic yeasts are distributed widely on Earth and have developed adaptation strategies to overcome the effect of low temperatures. They can adapt to low temperatures better than bacteriophyta. However, to date, their whole-genome sequences have been limited to the analysis of single strains of psychrophilic yeasts, which cannot be used to reveal their possible psychrophilic mechanisms to adapt to low temperatures accurately and comprehensively. This study aimed to compare different sources of psychrophilic yeasts at the genomic level and investigate their cold-adaptability mechanisms in a comprehensive manner. Nine genomes of known psychrophilic yeasts and three representative genomes of mesophilic yeasts were collected and annotated. Comparative genomic analysis was performed to compare the differences in their signaling pathways, metabolic regulations, evolution, and psychrophilic genes. The results showed that fatty acid desaturase coding genes are universal and diverse in psychophilic yeasts, and different numbers of these genes exist (delta 6, delta 9, delta 12, and delta 15) in the genomes of various psychrophilic yeasts. Therefore, they can synthesize polyunsaturated fatty acids (PUFAs) in a variety of ways and may be able to enhance the fluidity of cell membranes at low temperatures by synthesizing C18:3 or C18:4 PUFAs, thereby ensuring their ability to adapt to low-temperature environments. However, mesophilic yeasts have lost most of these genes. In this study, psychrophilic yeasts could adapt to low temperatures primarily by synthesizing PUFAs and diverse antifreeze proteins. A comparison of more psychrophilic yeasts' genomes will be useful for the study of their psychrophilic mechanisms, given the presence of additional potential psychrophilic-related genes in the genomes of psychrophilic yeasts. This study provides a reference for the study of the psychrophilic mechanisms of psychrophilic yeasts.
Collapse
Affiliation(s)
- Haisheng Liu
- College of Agriculture and Bioengineering, Heze University, Heze 274000, China
| | - Guiliang Zheng
- College of Marine Life Science, Ocean University of China, Qingdao 266100, China
| | - Zhongwei Chen
- Nantong Ocean Centre of the Ministry of Natural Resources, Nantong 226002, China
| | - Xiaoya Ding
- College of Marine Life Science, Ocean University of China, Qingdao 266100, China
| | - Jinran Wu
- College of Agriculture and Bioengineering, Heze University, Heze 274000, China
| | - Haili Zhang
- College of Agriculture and Bioengineering, Heze University, Heze 274000, China
| | - Shulei Jia
- Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
| |
Collapse
|
14
|
Erickson E, Gado JE, Avilán L, Bratti F, Brizendine RK, Cox PA, Gill R, Graham R, Kim DJ, König G, Michener WE, Poudel S, Ramirez KJ, Shakespeare TJ, Zahn M, Boyd ES, Payne CM, DuBois JL, Pickford AR, Beckham GT, McGeehan JE. Sourcing thermotolerant poly(ethylene terephthalate) hydrolase scaffolds from natural diversity. Nat Commun 2022; 13:7850. [PMID: 36543766 PMCID: PMC9772341 DOI: 10.1038/s41467-022-35237-x] [Citation(s) in RCA: 64] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Accepted: 11/21/2022] [Indexed: 12/24/2022] Open
Abstract
Enzymatic deconstruction of poly(ethylene terephthalate) (PET) is under intense investigation, given the ability of hydrolase enzymes to depolymerize PET to its constituent monomers near the polymer glass transition temperature. To date, reported PET hydrolases have been sourced from a relatively narrow sequence space. Here, we identify additional PET-active biocatalysts from natural diversity by using bioinformatics and machine learning to mine 74 putative thermotolerant PET hydrolases. We successfully express, purify, and assay 51 enzymes from seven distinct phylogenetic groups; observing PET hydrolysis activity on amorphous PET film from 37 enzymes in reactions spanning pH from 4.5-9.0 and temperatures from 30-70 °C. We conduct PET hydrolysis time-course reactions with the best-performing enzymes, where we observe differences in substrate selectivity as function of PET morphology. We employed X-ray crystallography and AlphaFold to examine the enzyme architectures of all 74 candidates, revealing protein folds and accessory domains not previously associated with PET deconstruction. Overall, this study expands the number and diversity of thermotolerant scaffolds for enzymatic PET deconstruction.
Collapse
Affiliation(s)
- Erika Erickson
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Japheth E Gado
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Luisana Avilán
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Felicia Bratti
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Richard K Brizendine
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Paul A Cox
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Raj Gill
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Rosie Graham
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Dong-Jin Kim
- BOTTLE Consortium, Golden, CO, USA
- Department of Biochemistry, Montana State University, Bozeman, MT, USA
| | - Gerhard König
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - William E Michener
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Saroj Poudel
- Department of Microbiology and Cell Biology, Montana State University, Bozeman, MT, USA
| | - Kelsey J Ramirez
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Thomas J Shakespeare
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Michael Zahn
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Eric S Boyd
- Department of Microbiology and Cell Biology, Montana State University, Bozeman, MT, USA
| | | | - Jennifer L DuBois
- BOTTLE Consortium, Golden, CO, USA
- Department of Biochemistry, Montana State University, Bozeman, MT, USA
| | - Andrew R Pickford
- BOTTLE Consortium, Golden, CO, USA
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Gregg T Beckham
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA.
- BOTTLE Consortium, Golden, CO, USA.
| | - John E McGeehan
- BOTTLE Consortium, Golden, CO, USA.
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK.
- World Plastics Association, Fontvieille, Monaco.
| |
Collapse
|
15
|
Khan A, Uddin J, Ali F, Ahmad A, Alghushairy O, Banjar A, Daud A. Prediction of antifreeze proteins using machine learning. Sci Rep 2022; 12:20672. [PMID: 36450775 PMCID: PMC9712683 DOI: 10.1038/s41598-022-24501-1] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Accepted: 11/16/2022] [Indexed: 12/03/2022] Open
Abstract
Living organisms including fishes, microbes, and animals can live in extremely cold weather. To stay alive in cold environments, these species generate antifreeze proteins (AFPs), also referred to as ice-binding proteins. Moreover, AFPs are extensively utilized in many important fields including medical, agricultural, industrial, and biotechnological. Several predictors were constructed to identify AFPs. However, due to the sequence and structural heterogeneity of AFPs, correct identification is still a challenging task. It is highly desirable to develop a more promising predictor. In this research, a novel computational method, named AFP-LXGB has been proposed for prediction of AFPs more precisely. The information is explored by Dipeptide Composition (DPC), Grouped Amino Acid Composition (GAAC), Position Specific Scoring Matrix-Segmentation-Autocorrelation Transformation (Sg-PSSM-ACT), and Pseudo Position Specific Scoring Matrix Tri-Slicing (PseTS-PSSM). Keeping the benefits of ensemble learning, these feature sets are concatenated into different combinations. The best feature set is selected by Extremely Randomized Tree-Recursive Feature Elimination (ERT-RFE). The models are trained by Light eXtreme Gradient Boosting (LXGB), Random Forest (RF), and Extremely Randomized Tree (ERT). Among classifiers, LXGB has obtained the best prediction results. The novel method (AFP-LXGB) improved the accuracies by 3.70% and 4.09% than the best methods. These results verified that AFP-LXGB can predict AFPs more accurately and can participate in a significant role in medical, agricultural, industrial, and biotechnological fields.
Collapse
Affiliation(s)
- Adnan Khan
- grid.444994.00000 0004 0609 284XQurtuba University of Science and Technology, Peshawar, Khyber Pakhtunkhwa Pakistan
| | - Jamal Uddin
- grid.444994.00000 0004 0609 284XQurtuba University of Science and Technology, Peshawar, Khyber Pakhtunkhwa Pakistan
| | - Farman Ali
- Department of Elementary and Secondary Education, Peshawar, Khyber Pakhtunkhwa Pakistan ,grid.444996.20000 0004 0609 292XSarhad University of Science and Information Technology, Mardan, Pakistan
| | - Ashfaq Ahmad
- grid.440522.50000 0004 0478 6450Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Omar Alghushairy
- grid.460099.2Department of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia
| | - Ameen Banjar
- grid.460099.2Department of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia
| | - Ali Daud
- Abu Dhabi School of Management, Abu Dhabi, United Arab Emirates ,grid.460099.2Department of Computer Science and Artificial Intelligence, University of Jeddah, Jeddah, Saudi Arabia
| |
Collapse
|
16
|
Satyakam, Zinta G, Singh RK, Kumar R. Cold adaptation strategies in plants—An emerging role of epigenetics and antifreeze proteins to engineer cold resilient plants. Front Genet 2022; 13:909007. [PMID: 36092945 PMCID: PMC9459425 DOI: 10.3389/fgene.2022.909007] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Accepted: 07/21/2022] [Indexed: 11/13/2022] Open
Abstract
Cold stress adversely affects plant growth, development, and yield. Also, the spatial and geographical distribution of plant species is influenced by low temperatures. Cold stress includes chilling and/or freezing temperatures, which trigger entirely different plant responses. Freezing tolerance is acquired via the cold acclimation process, which involves prior exposure to non-lethal low temperatures followed by profound alterations in cell membrane rigidity, transcriptome, compatible solutes, pigments and cold-responsive proteins such as antifreeze proteins. Moreover, epigenetic mechanisms such as DNA methylation, histone modifications, chromatin dynamics and small non-coding RNAs play a crucial role in cold stress adaptation. Here, we provide a recent update on cold-induced signaling and regulatory mechanisms. Emphasis is given to the role of epigenetic mechanisms and antifreeze proteins in imparting cold stress tolerance in plants. Lastly, we discuss genetic manipulation strategies to improve cold tolerance and develop cold-resistant plants.
Collapse
|
17
|
Zhang YH, Li ZD, Zeng T, Chen L, Huang T, Cai YD. Screening gene signatures for clinical response subtypes of lung transplantation. Mol Genet Genomics 2022; 297:1301-1313. [PMID: 35780439 DOI: 10.1007/s00438-022-01918-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Accepted: 06/12/2022] [Indexed: 11/30/2022]
Abstract
Lung is the most important organ in the human respiratory system, whose normal functions are quite essential for human beings. Under certain pathological conditions, the normal lung functions could no longer be maintained in patients, and lung transplantation is generally applied to ease patients' breathing and prolong their lives. However, several risk factors exist during and after lung transplantation, including bleeding, infection, and transplant rejections. In particular, transplant rejections are difficult to predict or prevent, leading to the most dangerous complications and severe status in patients undergoing lung transplantation. Given that most common monitoring and validation methods for lung transplantation rejections may take quite a long time and have low reproducibility, new technologies and methods are required to improve the efficacy and accuracy of rejection monitoring after lung transplantation. Recently, one previous study set up the gene expression profiles of patients who underwent lung transplantation. However, it did not provide a tool to predict lung transplantation responses. Here, a further deep investigation was conducted on such profiling data. A computational framework, incorporating several machine learning algorithms, such as feature selection methods and classification algorithms, was built to establish an effective prediction model distinguishing patient into different clinical subgroups, corresponding to different rejection responses after lung transplantation. Furthermore, the framework also screened essential genes with functional enrichments and create quantitative rules for the distinction of patients with different rejection responses to lung transplantation. The outcome of this contribution could provide guidelines for clinical treatment of each rejection subtype and contribute to the revealing of complicated rejection mechanisms of lung transplantation.
Collapse
Affiliation(s)
- Yu-Hang Zhang
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Zhan Dong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, 130052, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, China.
| |
Collapse
|
18
|
Li H, Zhang S, Chen L, Pan X, Li Z, Huang T, Cai YD. Identifying Functions of Proteins in Mice With Functional Embedding Features. Front Genet 2022; 13:909040. [PMID: 35651937 PMCID: PMC9149260 DOI: 10.3389/fgene.2022.909040] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Accepted: 04/28/2022] [Indexed: 12/02/2022] Open
Abstract
In current biology, exploring the biological functions of proteins is important. Given the large number of proteins in some organisms, exploring their functions one by one through traditional experiments is impossible. Therefore, developing quick and reliable methods for identifying protein functions is necessary. Considerable accumulation of protein knowledge and recent developments on computer science provide an alternative way to complete this task, that is, designing computational methods. Several efforts have been made in this field. Most previous methods have adopted the protein sequence features or directly used the linkage from a protein–protein interaction (PPI) network. In this study, we proposed some novel multi-label classifiers, which adopted new embedding features to represent proteins. These features were derived from functional domains and a PPI network via word embedding and network embedding, respectively. The minimum redundancy maximum relevance method was used to assess the features, generating a feature list. Incremental feature selection, incorporating RAndom k-labELsets to construct multi-label classifiers, used such list to construct two optimum classifiers, corresponding to two key measurements: accuracy and exact match. These two classifiers had good performance, and they were superior to classifiers that used features extracted by traditional methods.
Collapse
Affiliation(s)
- Hao Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - ShiQi Zhang
- Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - ZhanDong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China.,CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
19
|
Li Z, Pan X, Cai YD. Identification of Type 2 Diabetes Biomarkers From Mixed Single-Cell Sequencing Data With Feature Selection Methods. Front Bioeng Biotechnol 2022; 10:890901. [PMID: 35721855 PMCID: PMC9201257 DOI: 10.3389/fbioe.2022.890901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 04/04/2022] [Indexed: 11/18/2022] Open
Abstract
Diabetes is the most common disease and a major threat to human health. Type 2 diabetes (T2D) makes up about 90% of all cases. With the development of high-throughput sequencing technologies, more and more fundamental pathogenesis of T2D at genetic and transcriptomic levels has been revealed. The recent single-cell sequencing can further reveal the cellular heterogenicity of complex diseases in an unprecedented way. With the expectation on the molecular essence of T2D across multiple cell types, we investigated the expression profiling of more than 1,600 single cells (949 cells from T2D patients and 651 cells from normal controls) and identified the differential expression profiling and characteristics at the transcriptomics level that can distinguish such two groups of cells at the single-cell level. The expression profile was analyzed by several machine learning algorithms, including Monte Carlo feature selection, support vector machine, and repeated incremental pruning to produce error reduction (RIPPER). On one hand, some T2D-associated genes (MTND4P24, MTND2P28, and LOC100128906) were discovered. On the other hand, we revealed novel potential pathogenic mechanisms in a rule manner. They are induced by newly recognized genes and neglected by traditional bulk sequencing techniques. Particularly, the newly identified T2D genes were shown to follow specific quantitative rules with diabetes prediction potentials, and such rules further indicated several potential functional crosstalks involved in T2D.
Collapse
Affiliation(s)
- Zhandong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Xiaoyong Pan
- Key Laboratory of System Control and Information Processing, Institute of Image Processing and Pattern Recognition, Ministry of Education of China, Shanghai Jiao Tong University, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- *Correspondence: Yu-Dong Cai,
| |
Collapse
|
20
|
Bereded NK, Abebe GB, Fanta SW, Curto M, Waidbacher H, Meimberg H, Domig KJ. The gut bacterial microbiome of Nile tilapia (Oreochromis niloticus) from lakes across an altitudinal gradient. BMC Microbiol 2022; 22:87. [PMID: 35379180 PMCID: PMC8978401 DOI: 10.1186/s12866-022-02496-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Accepted: 03/17/2022] [Indexed: 12/27/2022] Open
Abstract
Background Microorganisms inhabiting the gut play a significant role in supporting fundamental physiological processes of the host, which contributes to their survival in varied environments. Several studies have shown that altitude affects the composition and diversity of intestinal microbial communities in terrestrial animals. However, little is known about the impact of altitude on the gut microbiota of aquatic animals. The current study examined the variations in the gut microbiota of Nile tilapia (Oreochromis niloticus) from four lakes along an altitudinal gradient in Ethiopia by using 16S rDNA Illumina MiSeq high-throughput sequencing. Results The results indicated that low-altitude samples typically displayed greater alpha diversity. The results of principal coordinate analysis (PCoA) showed significant differences across samples from different lakes. Firmicutes was the most abundant phylum in the Lake Awassa and Lake Chamo samples whereas Fusobacteriota was the dominant phylum in samples from Lake Hashengie and Lake Tana. The ratio of Firmicutes to Bacteroidota in the high-altitude sample (Lake Hashengie, altitude 2440 m) was much higher than the ratio of Firmicutes to Bacteroidota in the low altitude population (Lake Chamo, altitude 1235 m). We found that the relative abundances of Actinobacteriota, Chloroflexi, Cyanobacteria, and Firmicutes were negatively correlated with altitude, while Fusobacteriota showed a positive association with altitude. Despite variability in the abundance of the gut microbiota across the lakes, some shared bacterial communities were detected. Conclusions In summary, this study showed the indirect influence of altitude on gut microbiota. Altitude has the potential to modulate the gut microbiota composition and diversity of Nile tilapia. Future work will be needed to elucidate the functional significance of gut microbiota variations based on the geographical environment. Significance and impact of the study Our study determined the composition and diversity of the gut microbiota in Nile tilapia collected from lakes across an altitude gradient. Our findings greatly extend the baseline knowledge of fish gut microbiota in Ethiopian lakes that plays an important role in this species sustainable aquaculture activities and conservation. Supplementary Information The online version contains supplementary material available at 10.1186/s12866-022-02496-z.
Collapse
Affiliation(s)
- Negash Kabtimer Bereded
- University of Natural Resources and Life Sciences, Vienna, Austria. .,Department of Food Science and Technology, Institute of Food Science, Muthgasse 18, 1190, Vienna, Austria. .,Department of Biology, Bahir Dar University, Post Code 79, Bahir Dar, Ethiopia.
| | | | - Solomon Workneh Fanta
- Faculty of Chemical and Food Engineering, Bahir Dar Institute of Technology, Bahir Dar University, Post Code 26, Bahir Dar, Ethiopia
| | - Manuel Curto
- Department of Integrative Biology and Biodiversity Research, Institute for Integrative Nature Conservation Research, Gregor Mendel Strasse 33, 1180, Vienna, Austria.,MARE-Marine and Environmental Sciences Centre, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, 1049-001, Lisboa, Portugal
| | - Herwig Waidbacher
- Department of Water, Atmosphere and Environment, Institute of Hydrobiology and Aquatic Ecosystem Management, Gregor Mendel Strasse 33, 1180, Vienna, Austria
| | - Harald Meimberg
- Department of Integrative Biology and Biodiversity Research, Institute for Integrative Nature Conservation Research, Gregor Mendel Strasse 33, 1180, Vienna, Austria
| | - Konrad J Domig
- University of Natural Resources and Life Sciences, Vienna, Austria.,Department of Food Science and Technology, Institute of Food Science, Muthgasse 18, 1190, Vienna, Austria
| |
Collapse
|
21
|
Box ICH, Matthews BJ, Marshall KE. Molecular evidence of intertidal habitats selecting for repeated ice-binding protein evolution in invertebrates. J Exp Biol 2022; 225:274373. [PMID: 35258616 DOI: 10.1242/jeb.243409] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Accepted: 12/20/2021] [Indexed: 12/21/2022]
Abstract
Ice-binding proteins (IBPs) have evolved independently in multiple taxonomic groups to improve their survival at sub-zero temperatures. Intertidal invertebrates in temperate and polar regions frequently encounter sub-zero temperatures, yet there is little information on IBPs in these organisms. We hypothesized that there are far more IBPs than are currently known and that the occurrence of freezing in the intertidal zone selects for these proteins. We compiled a list of genome-sequenced invertebrates across multiple habitats and a list of known IBP sequences and used BLAST to identify a wide array of putative IBPs in those invertebrates. We found that the probability of an invertebrate species having an IBP was significantly greater in intertidal species than in those primarily found in open ocean or freshwater habitats. These intertidal IBPs had high sequence similarity to fish and tick antifreeze glycoproteins and fish type II antifreeze proteins. Previously established classifiers based on machine learning techniques further predicted ice-binding activity in the majority of our newly identified putative IBPs. We investigated the potential evolutionary origin of one putative IBP from the hard-shelled mussel Mytilus coruscus and suggest that it arose through gene duplication and neofunctionalization. We show that IBPs likely readily evolve in response to freezing risk and that there is an array of uncharacterized IBPs, and highlight the need for broader laboratory-based surveys of the diversity of ice-binding activity across diverse taxonomic and ecological groups.
Collapse
Affiliation(s)
- Isaiah C H Box
- Department of Zoology, University of British Columbia, 6270 University Blvd, Vancouver, BC, CanadaV6T 1Z4
| | - Benjamin J Matthews
- Department of Zoology, University of British Columbia, 6270 University Blvd, Vancouver, BC, CanadaV6T 1Z4
| | - Katie E Marshall
- Department of Zoology, University of British Columbia, 6270 University Blvd, Vancouver, BC, CanadaV6T 1Z4
| |
Collapse
|
22
|
Li X, Lu L, Chen L. Identification of protein functions in mouse with a label space partition method. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:3820-3842. [PMID: 35341276 DOI: 10.3934/mbe.2022176] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Protein is very important for almost all living creatures because it participates in most complicated and essential biological processes. Determining the functions of given proteins is one of the most essential problems in protein science. Such determination can be conducted through traditional experiments. However, the experimental methods are always time-consuming and of high costs. In recent years, computational methods give useful aids for identification of protein functions. This study presented a new multi-label classifier for identifying functions of mouse proteins. Due to the number of functional types, which were termed as labels in the classification procedure, a label space partition method was employed to divide labels into some partitions. On each partition, a multi-label classifier was constructed. The classifiers based on all partitions were integrated in the proposed classifier. The cross-validation results proved that the proposed classifier was of good performance. Classifiers with label partition were superior to those without label partition or with random label partition.
Collapse
Affiliation(s)
- Xuan Li
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lin Lu
- Department of Radiology, Columbia University Medical Center, New York 10032, USA
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
23
|
Usman M, Khan S, Park S, Wahab A. AFP-SRC: identification of antifreeze proteins using sparse representation classifier. Neural Comput Appl 2022. [DOI: 10.1007/s00521-021-06558-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
24
|
Bhukya R, Kumari A, Amilpur S, Dasari CM. PPred-PCKSM: A multi-layer predictor for identifying promoter and its variants using position based features. Comput Biol Chem 2022; 97:107623. [DOI: 10.1016/j.compbiolchem.2022.107623] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Revised: 01/02/2022] [Accepted: 01/05/2022] [Indexed: 11/03/2022]
|
25
|
Ali F, Akbar S, Ghulam A, Maher ZA, Unar A, Talpur DB. AFP-CMBPred: Computational identification of antifreeze proteins by extending consensus sequences into multi-blocks evolutionary information. Comput Biol Med 2021; 139:105006. [PMID: 34749096 DOI: 10.1016/j.compbiomed.2021.105006] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Revised: 10/29/2021] [Accepted: 10/29/2021] [Indexed: 11/30/2022]
Abstract
In extremely cold environments, living organisms like plants, animals, fishes, and microbes can die due to the intracellular ice formation in their bodies. To sustain life in such cold environments, some cold-blooded species produced Antifreeze proteins (AFPs), also called ice-binding proteins. AFPs are not only limited to the medical field but also have diverse significance in the area of biotechnology, agriculture, and the food industry. Different AFPs exhibit high heterogeneity in their structures and sequences. Keeping the significance of AFPs, several machine-learning-based models have been developed by scientists for the prediction of AFPs. However, due to the complex and diverse nature of AFPs, the prediction performance of the existing methods is limited. Therefore, it is highly indispensable for researchers to develop a reliable computational model that can accurately predict AFPs. In this connection, this study presents a novel predictor for AFPs, named AFP-CMBPred. The sequences of AFPs are formulated via four different feature representation methods, such as Amphiphilic pseudo amino acid composition (Amp-PseAAC), Dipeptide Deviation from Expected Mean (DDE), Multi-Blocks Position Specific Scoring Matrix (MB-PSSM), and Consensus Sequence-based on Multi-Blocks Position Specific Scoring Matrix (CS-MB-PSSM) to collect local and global descriptors. In the next step, the extracted feature vectors are evaluated via Support Vector Machine (SVM) and Random Forest (RF) based classification learners. The prediction performance of both classifiers is further assessed using three validation methods i.e., jackknife test, 10-fold cross-validation test, and independent test. After examining the prediction rates of all validation tests, it was found that our proposed model achieved the higher prediction accuracies of ∼2.65%, ∼2.84%, and ∼3.37% using jackknife, K-fold, and independent test, respectively. The experimental outcomes validate that our proposed "AFP-CMBPred" predictor secured the highest prediction results than the existing models for the identification of AFPs. It is further anticipated that our proposed AFP-CMBPred model will be considered a valuable tool in the research academia and drug development.
Collapse
Affiliation(s)
- Farman Ali
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China.
| | - Shahid Akbar
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Ali Ghulam
- Computerization and Network Section, Sindh Agriculture University, Tandojam, Pakistan
| | | | - Ahsanullah Unar
- School of Life Science, University of Science and Technology, China
| | - Dhani Bux Talpur
- School of Information and Communication Engineering, Guilin University of Electronic Technology, Guilin, China
| |
Collapse
|
26
|
iMPT-FDNPL: Identification of Membrane Protein Types with Functional Domains and a Natural Language Processing Approach. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:7681497. [PMID: 34671418 PMCID: PMC8523280 DOI: 10.1155/2021/7681497] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/22/2021] [Revised: 09/15/2021] [Accepted: 09/27/2021] [Indexed: 12/20/2022]
Abstract
Membrane protein is an important kind of proteins. It plays essential roles in several cellular processes. Based on the intramolecular arrangements and positions in a cell, membrane proteins can be divided into several types. It is reported that the types of a membrane protein are highly related to its functions. Determination of membrane protein types is a hot topic in recent years. A plenty of computational methods have been proposed so far. Some of them used functional domain information to encode proteins. However, this procedure was still crude. In this study, we designed a novel feature extraction scheme to obtain informative features of proteins from their functional domain information. Such scheme termed domains as words and proteins, represented by its domains, as sentences. The natural language processing approach, word2vector, was applied to access the features of domains, which were further refined to protein features. Based on these features, RAndom k-labELsets with random forest as the base classifier was employed to build the multilabel classifier, namely, iMPT-FDNPL. The tenfold cross-validation results indicated the good performance of such classifier. Furthermore, such classifier was superior to other classifiers based on features derived from functional domains via one-hot scheme or derived from other properties of proteins, suggesting the effectiveness of protein features generated by the proposed scheme.
Collapse
|
27
|
Al-Saggaf UM, Usman M, Naseem I, Moinuddin M, Jiman AA, Alsaggaf MU, Alshoubaki HK, Khan S. ECM-LSE: Prediction of Extracellular Matrix Proteins Using Deep Latent Space Encoding of k-Spaced Amino Acid Pairs. Front Bioeng Biotechnol 2021; 9:752658. [PMID: 34722479 PMCID: PMC8552119 DOI: 10.3389/fbioe.2021.752658] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 09/13/2021] [Indexed: 12/26/2022] Open
Abstract
Extracelluar matrix (ECM) proteins create complex networks of macromolecules which fill-in the extracellular spaces of living tissues. They provide structural support and play an important role in maintaining cellular functions. Identification of ECM proteins can play a vital role in studying various types of diseases. Conventional wet lab-based methods are reliable; however, they are expensive and time consuming and are, therefore, not scalable. In this research, we propose a sequence-based novel machine learning approach for the prediction of ECM proteins. In the proposed method, composition of k-spaced amino acid pair (CKSAAP) features are encoded into a classifiable latent space (LS) with the help of deep latent space encoding (LSE). A comprehensive ablation analysis is conducted for performance evaluation of the proposed method. Results are compared with other state-of-the-art methods on the benchmark dataset, and the proposed ECM-LSE approach has shown to comprehensively outperform the contemporary methods.
Collapse
Affiliation(s)
- Ubaid M. Al-Saggaf
- Center of Excellence in Intelligent Engineering Systems, King Abdulaziz University, Jeddah, Saudi Arabia
- Electrical and Computer Engineering Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Muhammad Usman
- Department of Computer Engineering, Chosun University, Gwangju, South Korea
| | - Imran Naseem
- Research and Development, Love For Data, Karachi, Pakistan
- School of Electrical, Electronic and Computer Engineering, The University of Western Australia, Perth, WA, Australia
- College of Engineering, Karachi Institute of Economics and Technology, Korangi Creek, Karachi, Pakistan
| | - Muhammad Moinuddin
- Center of Excellence in Intelligent Engineering Systems, King Abdulaziz University, Jeddah, Saudi Arabia
- Electrical and Computer Engineering Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmad A. Jiman
- Electrical and Computer Engineering Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mohammed U. Alsaggaf
- Center of Excellence in Intelligent Engineering Systems, King Abdulaziz University, Jeddah, Saudi Arabia
- Department of Radiology, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Hitham K. Alshoubaki
- Center of Excellence in Intelligent Engineering Systems, King Abdulaziz University, Jeddah, Saudi Arabia
- Electrical and Computer Engineering Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Shujaat Khan
- Department of Bio and Brain Engineering, Daejeon, South Korea
| |
Collapse
|
28
|
Prediction and analysis of antifreeze proteins. Heliyon 2021; 7:e07953. [PMID: 34604556 PMCID: PMC8473546 DOI: 10.1016/j.heliyon.2021.e07953] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Revised: 07/28/2021] [Accepted: 09/03/2021] [Indexed: 11/20/2022] Open
Abstract
Antifreeze proteins (AFPs) are proteins that protect cellular fluids and body fluids from freezing by inhibiting the nucleation and growth of ice crystals and preventing ice recrystallization, thereby contributing to the maintenance of life in living organisms. They exist in fish, insects, microorganisms, and fungi. However, the number of known AFPs is currently limited, and it is essential to construct a reliable dataset of AFPs and develop a bioinformatics tool to predict AFPs. In this work, we first collected AFPs sequences from UniProtKB considering the reliability of annotations and, based on these datasets, developed a prediction system using random forest. We achieved accuracies of 0.961 and 0.947 for non-redundant sequences with less than 90% and 30% identities and achieved the accuracy of 0.953 for representative sequences for each species. Using the ability of random forest, we identified the sequence features that contributed to the prediction. Some sequence features were common to AFPs from different species. These features include the Cys content, Ala-Ala content, Trp-Gly content, and the amino acids' distribution related to the disorder propensity. The computer program and the dataset developed in this work are available from the GitHub site: https://github.com/ryomiya/Prediction-and-analysis-of-antifreeze-proteins.
Collapse
|
29
|
Wang S, Deng L, Xia X, Cao Z, Fei Y. Predicting antifreeze proteins with weighted generalized dipeptide composition and multi-regression feature selection ensemble. BMC Bioinformatics 2021; 22:340. [PMID: 34162327 PMCID: PMC8220696 DOI: 10.1186/s12859-021-04251-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Accepted: 06/09/2021] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Antifreeze proteins (AFPs) are a group of proteins that inhibit body fluids from growing to ice crystals and thus improve biological antifreeze ability. It is vital to the survival of living organisms in extremely cold environments. However, little research is performed on sequences feature extraction and selection for antifreeze proteins classification in the structure and function prediction, which is of great significance. RESULTS In this paper, to predict the antifreeze proteins, a feature representation of weighted generalized dipeptide composition (W-GDipC) and an ensemble feature selection based on two-stage and multi-regression method (LRMR-Ri) are proposed. Specifically, four feature selection algorithms: Lasso regression, Ridge regression, Maximal information coefficient and Relief are used to select the feature sets, respectively, which is the first stage of LRMR-Ri method. If there exists a common feature subset among the above four sets, it is the optimal subset; otherwise we use Ridge regression to select the optimal subset from the public set pooled by the four sets, which is the second stage of LRMR-Ri. The LRMR-Ri method combined with W-GDipC was performed both on the antifreeze proteins dataset (binary classification), and on the membrane protein dataset (multiple classification). Experimental results show that this method has good performance in support vector machine (SVM), decision tree (DT) and stochastic gradient descent (SGD). The values of ACC, RE and MCC of LRMR-Ri and W-GDipC with antifreeze proteins dataset and SVM classifier have reached as high as 95.56%, 97.06% and 0.9105, respectively, much higher than those of each single method: Lasso, Ridge, Mic and Relief, nearly 13% higher than single Lasso for ACC. CONCLUSION The experimental results show that the proposed LRMR-Ri and W-GDipC method can significantly improve the accuracy of antifreeze proteins prediction compared with other similar single feature methods. In addition, our method has also achieved good results in the classification and prediction of membrane proteins, which verifies its widely reliability to a certain extent.
Collapse
Affiliation(s)
- Shunfang Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, China.
| | - Lin Deng
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, China
| | - Xinnan Xia
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, China.
| | - Zicheng Cao
- School of Public Health (Shenzhen), Sun Yat-Sen University, Guangzhou, 510006, China
| | - Yu Fei
- School of Statistics and Mathematics, Yunnan University of Finance and Economics, Kunming, 650221, China.
| |
Collapse
|
30
|
Analysis of the Sequence Characteristics of Antifreeze Protein. Life (Basel) 2021; 11:life11060520. [PMID: 34204983 PMCID: PMC8226703 DOI: 10.3390/life11060520] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Revised: 05/27/2021] [Accepted: 05/31/2021] [Indexed: 12/31/2022] Open
Abstract
Antifreeze protein (AFP) is a proteinaceous compound with improved antifreeze ability and binding ability to ice to prevent its growth. As a surface-active material, a small number of AFPs have a tremendous influence on the growth of ice. Therefore, identifying novel AFPs is important to understand protein–ice interactions and create novel ice-binding domains. To date, predicting AFPs is difficult due to their low sequence similarity for the ice-binding domain and the lack of common features among different AFPs. Here, a computational engine was developed to predict the features of AFPs and reveal the most important 39 features for AFP identification, such as antifreeze-like/N-acetylneuraminic acid synthase C-terminal, insect AFP motif, C-type lectin-like, and EGF-like domain. With this newly presented computational method, a group of previously confirmed functional AFP motifs was screened out. This study has identified some potential new AFP motifs and contributes to understanding biological antifreeze mechanisms.
Collapse
|
31
|
Alim A, Rafay A, Naseem I. PoGB-pred: Prediction of Antifreeze Proteins Sequences Using Amino Acid Composition with Feature Selection Followed by a Sequential-based Ensemble Approach. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200707141926] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Proteins contribute significantly in every task of cellular life. Their
functions encompass the building and repairing of tissues in human bodies and other organisms.
Hence they are the building blocks of bones, muscles, cartilage, skin, and blood. Similarly, antifreeze
proteins are of prime significance for organisms that live in very cold areas. With the help of
these proteins, the cold water organisms can survive below zero temperature and resist the water
crystallization process, which may cause the rupture in the internal cells and tissues. AFP’s have
also attracted attention and interest in food industries and cryopreservation.
Objective:
With the increase in the availability of genomic sequence
data of protein, an automated and sophisticated tool for AFP recognition and identification is in dire need. The sequence
and structures of AFP are highly distinct, therefore, most of the proposed methods fail to show promising results on
different structures. A consolidated method is proposed to produce the competitive performance on highly distinct AFP
structure.
Methods:
In this study, machine learning-based algorithms including Principal Component Analysis
(PCA) followed by Gradient Boosting (GB) were proposed to be used for anti-freeze protein
identification. To analyze the performance and validation of the proposed model, various
combinations of two segments' composition of amino acid and dipeptides are used. PCA, in
particular, is proposed for dimension reduction and high variance retaining of data, which is
followed by an ensemble method named gradient boosting for modeling and classification.
Results:
The proposed method obtained the
superfluous performance on PDB, Pfam and Uniprot dataset as compared with the RAFP-Pred method. In experiment-3,
by utilizing only 150 PCA components a high accuracy of 89.63 was achieved which is superior to the 87.41 utilizing 300
significant features reported for the RAFP-Pred method. Experiment-2 is conducted using two different dataset such that
non-AFP from the PISCES server and AFPs from Protein data bank. In this experiment-2, our proposed method attained
high sensitivity of 79.16 which is 12.50 better than state-of-the-art the RAFP-pred method.
Conclusion:
AFPs have a common function with distinct structure. Therefore, the development of a single model for
different sequences often fails to AFPs. A robust results have been shown by our proposed model on the diversity of
training and testing dataset. The results of the proposed model outperformed compared to the previous AFPs prediction method such as RAFP-Pred. Our model consists of PCA for dimension reduction followed by gradient boosting for
classification. Due to simplicity, scalability properties and high performance result our model can be easily extended for
analyzing the proteomic and genomic dataset.
Collapse
Affiliation(s)
- Affan Alim
- College of Computing and Information Sciences, Karachi Institute of Economics and Technology (KIET), Karachi 75190, Pakistan
| | - Abdul Rafay
- College of Computing and Information Sciences, Karachi Institute of Economics and Technology (KIET), Karachi 75190, Pakistan
| | - Imran Naseem
- School of Electrical, Electronic and Computer Engineering, the University of Western Australia, 35 Stirling Highway, Crawley, Western Australia 6009, Australia
| |
Collapse
|
32
|
Zhang S, Duan Z, Yang W, Qian C, You Y. iDHS-DASTS: identifying DNase I hypersensitive sites based on LASSO and stacking learning. Mol Omics 2021; 17:130-141. [PMID: 33295914 DOI: 10.1039/d0mo00115e] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The DNase I hypersensitivity site is an important marker of the DNA regulatory region, and its identification in the DNA sequence is of great significance for biomedical research. However, traditional identification methods are extremely time-consuming and can not obtain an accurate result. In this paper, we proposed a predictor called iDHS-DASTS to identify the DHS based on benchmark datasets. First, we adopt a feature extraction method called PseDNC which can incorporate the original DNA properties and spatial information of the DNA sequence. Then we use a method called LASSO to reduce the dimensions of the original data. Finally, we utilize stacking learning as a classifier, which includes Adaboost, random forest, gradient boosting, extra trees and SVM. Before we train the classifier, we use SMOTE-Tomek to overcome the imbalance of the datasets. In the experiment, our iDHS-DASTS achieves remarkable performance on three benchmark datasets. We achieve state-of-the-art results with over 92.06%, 91.06% and 90.72% accuracy for datasets [Doublestruck S]1, [Doublestruck S]2 and [Doublestruck S]3, respectively. To verify the validation and transferability of our model, we establish another independent dataset [Doublestruck S]4, for which the accuracy can reach 90.31%. Furthermore, we used the proposed model to construct a user friendly web server called iDHS-DASTS, which is available at http://www.xdu-duan.cn/.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China.
| | - Zhengpeng Duan
- School of Electronic Enginnering, Xidian University, Xi'an 710071, P. R. China
| | - Wenhao Yang
- School of Electronic Enginnering, Xidian University, Xi'an 710071, P. R. China
| | - Chenlai Qian
- School of Electronic Enginnering, Xidian University, Xi'an 710071, P. R. China
| | - Yiwei You
- International Business School, Shanghai University of International Business and Economics, Shanghai, 201620, P. R. China
| |
Collapse
|
33
|
Eskandari A, Leow TC, Rahman MBA, Oslan SN. Antifreeze Proteins and Their Practical Utilization in Industry, Medicine, and Agriculture. Biomolecules 2020; 10:biom10121649. [PMID: 33317024 PMCID: PMC7764015 DOI: 10.3390/biom10121649] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 11/28/2020] [Accepted: 11/30/2020] [Indexed: 12/15/2022] Open
Abstract
Antifreeze proteins (AFPs) are specific proteins, glycopeptides, and peptides made by different organisms to allow cells to survive in sub-zero conditions. AFPs function by reducing the water’s freezing point and avoiding ice crystals’ growth in the frozen stage. Their capability in modifying ice growth leads to the stabilization of ice crystals within a given temperature range and the inhibition of ice recrystallization that decreases the drip loss during thawing. This review presents the potential applications of AFPs from different sources and types. AFPs can be found in diverse sources such as fish, yeast, plants, bacteria, and insects. Various sources reveal different α-helices and β-sheets structures. Recently, analysis of AFPs has been conducted through bioinformatics tools to analyze their functions within proper time. AFPs can be used widely in various aspects of application and have significant industrial functions, encompassing the enhancement of foods’ freezing and liquefying properties, protection of frost plants, enhancement of ice cream’s texture, cryosurgery, and cryopreservation of cells and tissues. In conclusion, these applications and physical properties of AFPs can be further explored to meet other industrial players. Designing the peptide-based AFP can also be done to subsequently improve its function.
Collapse
Affiliation(s)
- Azadeh Eskandari
- Enzyme and Microbial Technology Research Centre, Universiti Putra Malaysia, UPM, Serdang 43400, Selangor, Malaysia; (A.E.); (T.C.L.)
- Department of Biochemistry, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, UPM, Serdang 43400, Selangor, Malaysia
| | - Thean Chor Leow
- Enzyme and Microbial Technology Research Centre, Universiti Putra Malaysia, UPM, Serdang 43400, Selangor, Malaysia; (A.E.); (T.C.L.)
- Department of Cell and Molecular Biology, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, UPM, Serdang 43400, Selangor, Malaysia
- Enzyme Technology Laboratory, Institute of Bioscience, Universiti Putra Malaysia, UPM, Serdang 43400, Selangor, Malaysia
| | | | - Siti Nurbaya Oslan
- Enzyme and Microbial Technology Research Centre, Universiti Putra Malaysia, UPM, Serdang 43400, Selangor, Malaysia; (A.E.); (T.C.L.)
- Department of Biochemistry, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, UPM, Serdang 43400, Selangor, Malaysia
- Enzyme Technology Laboratory, Institute of Bioscience, Universiti Putra Malaysia, UPM, Serdang 43400, Selangor, Malaysia
- Correspondence: ; Tel.: +60-39769-6710; Fax: +60-39769-7590
| |
Collapse
|
34
|
Liu L, Hu X, Feng Z, Wang S, Sun K, Xu S. Recognizing Ion Ligand-Binding Residues by Random Forest Algorithm Based on Optimized Dihedral Angle. Front Bioeng Biotechnol 2020; 8:493. [PMID: 32596216 PMCID: PMC7303464 DOI: 10.3389/fbioe.2020.00493] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Accepted: 04/28/2020] [Indexed: 11/26/2022] Open
Abstract
The prediction of ion ligand–binding residues in protein sequences is a challenging work that contributes to understand the specific functions of proteins in life processes. In this article, we selected binding residues of 14 ion ligands as research objects, including four acid radical ion ligands and 10 metal ion ligands. Based on the amino acid sequence information, we selected the composition and position conservation information of amino acids, the predicted structural information, and physicochemical properties of amino acids as basic feature parameters. We then performed a statistical analysis and reclassification for dihedral angle and proposed new methods on the extraction of feature parameters. The methods mainly included applying information entropy on the extraction of polarization charge and hydrophilic–hydrophobic information of amino acids and using position weight matrices on the extraction of position conservation information. In the prediction model, we used the random forest algorithm and obtained better prediction results than previous works. With the independent test, the Matthew's correlation coefficient and accuracy of 10 metal ion ligand–binding residues were larger than 0.07 and 52%, respectively; the corresponding evaluation values of four acid radical ion ligand–binding residues were larger than 0.15 and 86%, respectively. Further, we classified and combined the phi and psi angles and optimized prediction model for each ion ligand–binding residue.
Collapse
Affiliation(s)
- Liu Liu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Xiuzhen Hu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Zhenxing Feng
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Shan Wang
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Kai Sun
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Shuang Xu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| |
Collapse
|
35
|
Chou KC. An Insightful 10-year Recollection Since the Emergence of the 5-steps Rule. Curr Pharm Des 2020; 25:4223-4234. [PMID: 31782354 DOI: 10.2174/1381612825666191129164042] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2019] [Accepted: 11/25/2019] [Indexed: 11/22/2022]
Abstract
OBJECTIVE One of the most challenging and also the most difficult problems is how to formulate a biological sequence with a vector but considerably keep its sequence order information. METHODS To address such a problem, the approach of Pseudo Amino Acid Components or PseAAC has been developed. RESULTS AND CONCLUSION It has become increasingly clear via the 10-year recollection that the aforementioned proposal has been indeed very powerful.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, Massachusetts 02478, United States.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
36
|
Che J, Chen L, Guo ZH, Wang S, Aorigele. Drug Target Group Prediction with Multiple Drug Networks. Comb Chem High Throughput Screen 2020; 23:274-284. [DOI: 10.2174/1386207322666190702103927] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2018] [Revised: 03/11/2019] [Accepted: 04/15/2019] [Indexed: 02/07/2023]
Abstract
Background:
Identification of drug-target interaction is essential in drug discovery. It is
beneficial to predict unexpected therapeutic or adverse side effects of drugs. To date, several
computational methods have been proposed to predict drug-target interactions because they are
prompt and low-cost compared with traditional wet experiments.
Methods:
In this study, we investigated this problem in a different way. According to KEGG,
drugs were classified into several groups based on their target proteins. A multi-label classification
model was presented to assign drugs into correct target groups. To make full use of the known drug
properties, five networks were constructed, each of which represented drug associations in one
property. A powerful network embedding method, Mashup, was adopted to extract drug features
from above-mentioned networks, based on which several machine learning algorithms, including
RAndom k-labELsets (RAKEL) algorithm, Label Powerset (LP) algorithm and Support Vector
Machine (SVM), were used to build the classification model.
Results and Conclusion:
Tenfold cross-validation yielded the accuracy of 0.839, exact match of
0.816 and hamming loss of 0.037, indicating good performance of the model. The contribution of
each network was also analyzed. Furthermore, the network model with multiple networks was
found to be superior to the one with a single network and classic model, indicating the superiority
of the proposed model.
Collapse
Affiliation(s)
- Jingang Che
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Zi-Han Guo
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Shuaiqun Wang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Aorigele
- Faculty of Engineering, University of Toyama, Toyama, Japan
| |
Collapse
|
37
|
Usman M, Khan S, Lee JA. AFP-LSE: Antifreeze Proteins Prediction Using Latent Space Encoding of Composition of k-Spaced Amino Acid Pairs. Sci Rep 2020; 10:7197. [PMID: 32345989 PMCID: PMC7188683 DOI: 10.1038/s41598-020-63259-2] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Accepted: 03/26/2020] [Indexed: 02/06/2023] Open
Abstract
Species living in extremely cold environments resist the freezing conditions through antifreeze proteins (AFPs). Apart from being essential proteins for various organisms living in sub-zero temperatures, AFPs have numerous applications in different industries. They possess very small resemblance to each other and cannot be easily identified using simple search algorithms such as BLAST and PSI-BLAST. Diverse AFPs found in fishes (Type I, II, III, IV and antifreeze glycoproteins (AFGPs)), are sub-types and show low sequence and structural similarity, making their accurate prediction challenging. Although several machine-learning methods have been proposed for the classification of AFPs, prediction methods that have greater reliability are required. In this paper, we propose a novel machine-learning-based approach for the prediction of AFP sequences using latent space learning through a deep auto-encoder method. For latent space pruning, we use the output of the auto-encoder with a deep neural network classifier to learn the non-linear mapping of the protein sequence descriptor and class label. The proposed method outperformed the existing methods, yielding excellent results in comparison. A comprehensive ablation study is performed, and the proposed method is evaluated in terms of widely used performance measures. In particular, the proposed method demonstrated a high Matthews correlation coefficient of 0.52, F-score of 0.49, and Youden’s index of 0.81 on an independent test dataset, thereby outperforming the existing methods for AFP prediction.
Collapse
Affiliation(s)
- Muhammad Usman
- Department of Computer Engineering, Chosun University, Gwangju, 61452, Republic of Korea
| | - Shujaat Khan
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea
| | - Jeong-A Lee
- Department of Computer Engineering, Chosun University, Gwangju, 61452, Republic of Korea.
| |
Collapse
|
38
|
Sun S, Ding H, Wang D, Han S. Identifying Antifreeze Proteins Based on Key Evolutionary Information. Front Bioeng Biotechnol 2020; 8:244. [PMID: 32274383 PMCID: PMC7113384 DOI: 10.3389/fbioe.2020.00244] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2020] [Accepted: 03/09/2020] [Indexed: 01/08/2023] Open
Abstract
Antifreeze proteins are important antifreeze materials that have been widely used in industry, including in cryopreservation, de-icing, and food storage applications. However, the quantity of some commercially produced antifreeze proteins is insufficient for large-scale industrial applications. Further, many antifreeze proteins have properties such as cytotoxicity, severely hindering their applications. Understanding the mechanisms underlying the protein-ice interactions and identifying novel antifreeze proteins are, therefore, urgently needed. In this study, to uncover the mechanisms underlying protein-ice interactions and provide an efficient and accurate tool for identifying antifreeze proteins, we assessed various evolutionary features based on position-specific scoring matrices (PSSMs) and evaluated their importance for discriminating of antifreeze and non-antifreeze proteins. We then parsimoniously selected seven key features with the highest importance. We found that the selected features showed opposite tendencies (regarding the conservation of certain amino acids) between antifreeze and non-antifreeze proteins. Five out of the seven features had relatively high contributions to the discrimination of antifreeze and non-antifreeze proteins, as revealed by a principal component analysis, i.e., the conservation of the replacement of Cys, Trp, and Gly in antifreeze proteins by Ala, Met, and Ala, respectively, in the related proteins, and the conservation of the replacement of Arg in non-antifreeze proteins by Ser and Arg in the related proteins. Based on the seven parsimoniously selected key features, we established a classifier using support vector machine, which outperformed the state-of-the-art tools. These results suggest that understanding evolutionary information is crucial to designing accurate automated methods for discriminating antifreeze and non-antifreeze proteins. Our classifier, therefore, is an efficient tool for annotating new proteins with antifreeze functions based on sequence information and can facilitate their application in industry.
Collapse
Affiliation(s)
- Shanwen Sun
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Hui Ding
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Donghua Wang
- Department of General Surgery, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Shuguang Han
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
39
|
Xiang H, Yang X, Ke L, Hu Y. The properties, biotechnologies, and applications of antifreeze proteins. Int J Biol Macromol 2020; 153:661-675. [PMID: 32156540 DOI: 10.1016/j.ijbiomac.2020.03.040] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Revised: 03/04/2020] [Accepted: 03/06/2020] [Indexed: 01/30/2023]
Abstract
By natural selection, organisms evolve different solutions to cope with extremely cold weather. The emergence of an antifreeze protein gene is one of the most momentous solutions. Antifreeze proteins possess an importantly functional ability for organisms to survive in cold environments and are widely found in various cold-tolerant species. In this review, we summarize the origin of antifreeze proteins, describe the diversity of their species-specific properties and functions, and highlight the related biotechnology on the basis of both laboratory tests and bioinformatics analysis. The most recent advances in the applications of antifreeze proteins are also discussed. We expect that this systematic review will contribute to the comprehensive knowledge of antifreeze proteins to readers.
Collapse
Affiliation(s)
- Hong Xiang
- Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, People's Republic of China.; CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institutes of Advanced Technology
| | - Xiaohu Yang
- Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, People's Republic of China.; CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institutes of Advanced Technology
| | - Lei Ke
- Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, People's Republic of China.; CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institutes of Advanced Technology
| | - Yong Hu
- Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, People's Republic of China.; CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institutes of Advanced Technology.
| |
Collapse
|
40
|
Pugalenthi G, Nithya V, Chou KC, Archunan G. Nglyc: A Random Forest Method for Prediction of N-Glycosylation Sites in Eukaryotic Protein Sequence. Protein Pept Lett 2020; 27:178-186. [DOI: 10.2174/0929866526666191002111404] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2019] [Revised: 07/26/2019] [Accepted: 07/29/2019] [Indexed: 01/29/2023]
Abstract
Background:N-Glycosylation is one of the most important post-translational mechanisms in eukaryotes. N-glycosylation predominantly occurs in N-X-[S/T] sequon where X is any amino acid other than proline. However, not all N-X-[S/T] sequons in proteins are glycosylated. Therefore, accurate prediction of N-glycosylation sites is essential to understand Nglycosylation mechanism.Objective:In this article, our motivation is to develop a computational method to predict Nglycosylation sites in eukaryotic protein sequences.Methods:In this article, we report a random forest method, Nglyc, to predict N-glycosylation site from protein sequence, using 315 sequence features. The method was trained using a dataset of 600 N-glycosylation sites and 600 non-glycosylation sites and tested on the dataset containing 295 Nglycosylation sites and 253 non-glycosylation sites. Nglyc prediction was compared with NetNGlyc, EnsembleGly and GPP methods. Further, the performance of Nglyc was evaluated using human and mouse N-glycosylation sites.Results:Nglyc method achieved an overall training accuracy of 0.8033 with all 315 features. Performance comparison with NetNGlyc, EnsembleGly and GPP methods shows that Nglyc performs better than the other methods with high sensitivity and specificity rate.Conclusion:Our method achieved an overall accuracy of 0.8248 with 0.8305 sensitivity and 0.8182 specificity. Comparison study shows that our method performs better than the other methods. Applicability and success of our method was further evaluated using human and mouse N-glycosylation sites. Nglyc method is freely available at https://github.com/bioinformaticsML/ Ngly.
Collapse
Affiliation(s)
- Ganesan Pugalenthi
- Pheromone Technology Laboratory, Department of Animal Science, Bharathidasan University, Tiruchirappalli- 620024, India
| | - Varadharaju Nithya
- Department of Animal Health Management, Alagappa University, Karaikudi-630003, India
| | - Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, CA 92130, United States
| | - Govindaraju Archunan
- Pheromone Technology Laboratory, Department of Animal Science, Bharathidasan University, Tiruchirappalli- 620024, India
| |
Collapse
|
41
|
Zhao X, Chen L, Guo ZH, Liu T. Predicting Drug Side Effects with Compact Integration of Heterogeneous Networks. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190220114644] [Citation(s) in RCA: 72] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Background:
The side effects of drugs are not only harmful to humans but also the major
reasons for withdrawing approved drugs, bringing greater risks for pharmaceutical companies.
However, detecting the side effects for a given drug via traditional experiments is time- consuming
and expensive. In recent years, several computational methods have been proposed to predict the
side effects of drugs. However, most of the methods cannot effectively integrate the heterogeneous
properties of drugs.
Methods:
In this study, we adopted a network embedding method, Mashup, to extract essential and
informative drug features from several drug heterogeneous networks, representing different properties
of drugs. For side effects, a network was also built, from where side effect features were extracted.
These features can capture essential information about drugs and side effects in a network
level. Drug and side effect features were combined together to represent each pair of drug and side
effect, which was deemed as a sample in this study. Furthermore, they were fed into a random forest
(RF) algorithm to construct the prediction model, called the RF network model.
Results:
The RF network model was evaluated by several tests. The average of Matthews correlation
coefficients on the balanced and unbalanced datasets was 0.640 and 0.641, respectively.
Conclusion:
The RF network model was superior to the models incorporating other machine
learning algorithms and one previous model. Finally, we also investigated the influence of two feature
dimension parameters on the RF network model and found that our model was not very sensitive
to these parameters.
Collapse
Affiliation(s)
- Xian Zhao
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Zi-Han Guo
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Tao Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
42
|
Beltrán Lissabet JF, Herrera Belén L, Farias JG. TTAgP 1.0: A computational tool for the specific prediction of tumor T cell antigens. Comput Biol Chem 2019; 83:107103. [DOI: 10.1016/j.compbiolchem.2019.107103] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Revised: 06/20/2019] [Accepted: 08/10/2019] [Indexed: 01/27/2023]
|
43
|
Wang F, Guan ZX, Dao FY, Ding H. A Brief Review of the Computational Identification of Antifreeze Protein. CURR ORG CHEM 2019. [DOI: 10.2174/1385272823666190718145613] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Lots of cold-adapted organisms could produce antifreeze proteins (AFPs) to counter the freezing of cell fluids by controlling the growth of ice crystal. AFPs have been found in various species such as in vertebrates, invertebrates, plants, bacteria, and fungi. These AFPs from fish, insects and plants displayed a high diversity. Thus, the identification of the AFPs is a challenging task in computational proteomics. With the accumulation of AFPs and development of machine meaning methods, it is possible to construct a high-throughput tool to timely identify the AFPs. In this review, we briefly reviewed the application of machine learning methods in antifreeze proteins identification from difference section, including published benchmark dataset, sequence descriptor, classification algorithms and published methods. We hope that this review will produce new ideas and directions for the researches in identifying antifreeze proteins.
Collapse
Affiliation(s)
- Fang Wang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zheng-Xing Guan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
44
|
Chou KC. Advances in Predicting Subcellular Localization of Multi-label Proteins and its Implication for Developing Multi-target Drugs. Curr Med Chem 2019; 26:4918-4943. [PMID: 31060481 DOI: 10.2174/0929867326666190507082559] [Citation(s) in RCA: 78] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 01/29/2019] [Accepted: 01/31/2019] [Indexed: 12/16/2022]
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
45
|
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
46
|
Xiao X, Cheng X, Chen G, Mao Q, Chou KC. pLoc_bal-mVirus: Predict Subcellular Localization of Multi-Label Virus Proteins by Chou's General PseAAC and IHTS Treatment to Balance Training Dataset. Med Chem 2019; 15:496-509. [DOI: 10.2174/1573406415666181217114710] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2018] [Revised: 10/23/2018] [Accepted: 12/12/2018] [Indexed: 12/17/2022]
Abstract
Background/Objective:Knowledge of protein subcellular localization is vitally important for both basic research and drug development. Facing the avalanche of protein sequences emerging in the post-genomic age, it is urgent to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mVirus” was developed for identifying the subcellular localization of virus proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, known as “multiplex proteins”, may simultaneously occur in, or move between two or more subcellular location sites. Despite the fact that it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mVirus was trained by an extremely skewed dataset in which some subset was over 10 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset.Methods:Using the Chou's general PseAAC (Pseudo Amino Acid Composition) approach and the IHTS (Inserting Hypothetical Training Samples) treatment to balance out the training dataset, we have developed a new predictor called “pLoc_bal-mVirus” for predicting the subcellular localization of multi-label virus proteins.Results:Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mVirus, the existing state-of-theart predictor for the same purpose.Conclusion:Its user-friendly web-server is available at http://www.jci-bioinfo.cn/pLoc_balmVirus/, by which the majority of experimental scientists can easily get their desired results without the need to go through the detailed complicated mathematics. Accordingly, pLoc_bal-mVirus will become a very useful tool for designing multi-target drugs and in-depth understanding of the biological process in a cell.
Collapse
Affiliation(s)
- Xuan Xiao
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xiang Cheng
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Genqiang Chen
- College of Chemistry, Chemical Engineering and Biotechnology, Donghua University, Shanghai 201620, China
| | - Qi Mao
- College of Information Science and Technology, Donghua University, Shanghai, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
47
|
Chou KC, Cheng X, Xiao X. pLoc_bal-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by General PseAAC and Quasi-balancing Training Dataset. Med Chem 2019; 15:472-485. [DOI: 10.2174/1573406415666181218102517] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2018] [Revised: 10/23/2018] [Accepted: 12/12/2018] [Indexed: 12/24/2022]
Abstract
<P>Background/Objective: Information of protein subcellular localization is crucially important for both basic research and drug development. With the explosive growth of protein sequences discovered in the post-genomic age, it is highly demanded to develop powerful bioinformatics tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called “pLoc-mEuk” was developed for identifying the subcellular localization of eukaryotic proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems where many proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mEuk was trained by an extremely skewed dataset where some subset was about 200 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. </P><P> Methods: To alleviate such bias, we have developed a new predictor called pLoc_bal-mEuk by quasi-balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLocmEuk, the existing state-of-the-art predictor in identifying the subcellular localization of eukaryotic proteins. It has not escaped our notice that the quasi-balancing treatment can also be used to deal with many other biological systems. </P><P> Results: To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mEuk/. </P><P> Conclusion: It is anticipated that the pLoc_bal-Euk predictor holds very high potential to become a useful high throughput tool in identifying the subcellular localization of eukaryotic proteins, particularly for finding multi-target drugs that is currently a very hot trend trend in drug development.</P>
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xiang Cheng
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xuan Xiao
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
48
|
Lai HY, Zhang ZY, Su ZD, Su W, Ding H, Chen W, Lin H. iProEP: A Computational Predictor for Predicting Promoter. MOLECULAR THERAPY. NUCLEIC ACIDS 2019; 17:337-346. [PMID: 31299595 PMCID: PMC6616480 DOI: 10.1016/j.omtn.2019.05.028] [Citation(s) in RCA: 110] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/23/2019] [Revised: 05/18/2019] [Accepted: 05/19/2019] [Indexed: 11/29/2022]
Abstract
Promoter is a fundamental DNA element located around the transcription start site (TSS) and could regulate gene transcription. Promoter recognition is of great significance in determining transcription units, studying gene structure, analyzing gene regulation mechanisms, and annotating gene functional information. Many models have already been proposed to predict promoters. However, the performances of these methods still need to be improved. In this work, we combined pseudo k-tuple nucleotide composition (PseKNC) with position-correlation scoring function (PCSF) to formulate promoter sequences of Homo sapiens (H. sapiens), Drosophila melanogaster (D. melanogaster), Caenorhabditis elegans (C. elegans), Bacillus subtilis (B. subtilis), and Escherichia coli (E. coli). Minimum Redundancy Maximum Relevance (mRMR) algorithm and increment feature selection strategy were then adopted to find out optimal feature subsets. Support vector machine (SVM) was used to distinguish between promoters and non-promoters. In the 10-fold cross-validation test, accuracies of 93.3%, 93.9%, 95.7%, 95.2%, and 93.1% were obtained for H. sapiens, D. melanogaster, C. elegans, B. subtilis, and E. coli, with the areas under receiver operating curves (AUCs) of 0.974, 0.975, 0.981, 0.988, and 0.976, respectively. Comparative results demonstrated that our method outperforms existing methods for identifying promoters. An online web server was established that can be freely accessed (http://lin-group.cn/server/iProEP/).
Collapse
Affiliation(s)
- Hong-Yan Lai
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zhao-Yue Zhang
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zhen-Dong Su
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Su
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Ding
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China; Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan 063000, China.
| | - Hao Lin
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
49
|
Sustainability Evaluation of Process Planning for Single CNC Machine Tool under the Consideration of Energy-Efficient Control Strategies Using Random Forests. SUSTAINABILITY 2019. [DOI: 10.3390/su11113060] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
As an important part of industrialized society, manufacturing consumes a large amount of raw materials and energy, which motivates decision-makers to tackle this problem in different manners. Process planning is an important optimization method to realize the object, and energy consumption, carbon emission, or sustainability evaluation is the basis for the optimization stage. Although the evaluation research has drawn a great deal of attention, most of it neglects the influence of state control of machine tools on the energy consumption of machining processes. To address the above issue, a sustainability evaluation method of process planning for single computer numerical control (CNC) machine tool considering energy-efficient control strategies has been developed. First, four energy-efficient control strategies of CNC machine tools are constructed to reduce their energy consumption. Second, a bi-level energy-efficient decision-making mechanism using random forests is established to select appropriate control strategies for different occasions. Then, three indicators are adopted to evaluate the sustainability of process planning under the consideration of energy-efficient control strategies, i.e., energy consumption, relative delay time, and machining costs. Finally, a pedestal part machined by a 3-axis vertical milling machine tool is used to verify the proposed methods. The results show that the reduction in energy consumption considering energy-efficient control strategies reaches 25%.
Collapse
|
50
|
Niu B, Liang C, Lu Y, Zhao M, Chen Q, Zhang Y, Zheng L, Chou KC. Glioma stages prediction based on machine learning algorithm combined with protein-protein interaction networks. Genomics 2019; 112:837-847. [PMID: 31150762 DOI: 10.1016/j.ygeno.2019.05.024] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Accepted: 05/25/2019] [Indexed: 12/18/2022]
Abstract
BACKGROUND Glioma is the most lethal nervous system cancer. Recent studies have made great efforts to study the occurrence and development of glioma, but the molecular mechanisms are still unclear. This study was designed to reveal the molecular mechanisms of glioma based on protein-protein interaction network combined with machine learning methods. Key differentially expressed genes (DEGs) were screened and selected by using the protein-protein interaction (PPI) networks. RESULTS As a result, 19 genes between grade I and grade II, 21 genes between grade II and grade III, and 20 genes between grade III and grade IV. Then, five machine learning methods were employed to predict the gliomas stages based on the selected key genes. After comparison, Complement Naive Bayes classifier was employed to build the prediction model for grade II-III with accuracy 72.8%. And Random forest was employed to build the prediction model for grade I-II and grade III-VI with accuracy 97.1% and 83.2%, respectively. Finally, the selected genes were analyzed by PPI networks, Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, and the results improve our understanding of the biological functions of select DEGs involved in glioma growth. We expect that the key genes expressed have a guiding significance for the occurrence of gliomas or, at the very least, that they are useful for tumor researchers. CONCLUSION Machine learning combined with PPI networks, GO and KEGG analyses of selected DEGs improve our understanding of the biological functions involved in glioma growth.
Collapse
Affiliation(s)
- Bing Niu
- School of Life Sciences, Shanghai University, Shanghai 200444, China; Gordon Life Science Institute, Boston, MA 02478, USA.
| | - Chaofeng Liang
- Department of Neurosurgery, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Yi Lu
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Manman Zhao
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Qin Chen
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| | - Yuhui Zhang
- Renji Hospital, Medical School, Shanghai Jiaotong University, 160 Pujian Rd, New Pudong District, Shanghai 200127, China; Changhai Hospital, Second Military Medical University, Shanghai 200433, China.
| | - Linfeng Zheng
- Department of Radiology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200080, China; Department of Radiology, Shanghai First People's Hospital, Baoshan Branch, Shanghai 200940, China.
| | - Kuo-Chen Chou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; Gordon Life Science Institute, Boston, MA 02478, USA.
| |
Collapse
|