1
|
Susanty M, Mursalim MKN, Hertadi R, Purwarianti A, LE Rajab T. Leveraging protein language model embeddings and logistic regression for efficient and accurate in-silico acidophilic proteins classification. Comput Biol Chem 2024; 112:108163. [PMID: 39098138 DOI: 10.1016/j.compbiolchem.2024.108163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 07/02/2024] [Accepted: 07/24/2024] [Indexed: 08/06/2024]
Abstract
The increasing demand for eco-friendly technologies in biotechnology necessitates effective and sustainable catalysts. Acidophilic proteins, functioning optimally in highly acidic environments, hold immense promise for various applications, including food production, biofuels, and bioremediation. However, limited knowledge about these proteins hinders their exploration. This study addresses this gap by employing in silico methods utilizing computational tools and machine learning. We propose a novel approach to predict acidophilic proteins using protein language models (PLMs), accelerating discovery without extensive lab work. Our investigation highlights the potential of PLMs in understanding and harnessing acidophilic proteins for scientific and industrial advancements. We introduce the ACE model, which combines a simple Logistic Regression model with embeddings derived from protein sequences processed by the ProtT5 PLM. This model achieves high performance on an independent test set, with accuracy (0.91), F1-score (0.93), and Matthew's correlation coefficient (0.76). To our knowledge, this is the first application of pre-trained PLM embeddings for acidophilic protein classification. The ACE model serves as a powerful tool for exploring protein acidophilicity, paving the way for future advancements in protein design and engineering.
Collapse
Affiliation(s)
- Meredita Susanty
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Universitas Pertamina, School of Computer Science, Jl Teuku Nyak Arief Jakarta Selatan DKI, Jakarta, Indonesia
| | - Muhammad Khaerul Naim Mursalim
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Universitas UniversalKompleks Maha Vihara Duta Maitreya Bukit Beruntung, Sei Panas Batam, Kepulauan, Riau 29456, Indonesia
| | - Rukman Hertadi
- Institut Teknologi Bandung Faculty of Math and Natural Sciences, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia
| | - Ayu Purwarianti
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Center for Artificial Intelligence (U-CoE AI-VLB), Institut Teknologi Bandung, Bandung, Indonesia
| | - Tati LE Rajab
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia.
| |
Collapse
|
2
|
Arthi R, Parameswari E, Dhevagi P, Janaki P, Parimaladevi R. Microbial alchemists: unveiling the hidden potentials of halophilic organisms for soil restoration. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2024:10.1007/s11356-024-33949-9. [PMID: 38877191 DOI: 10.1007/s11356-024-33949-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Accepted: 06/05/2024] [Indexed: 06/16/2024]
Abstract
Salinity, resulting from various contaminants, is a major concern to global crop cultivation. Soil salinity results in increased osmotic stress, oxidative stress, specific ion toxicity, nutrient deficiency in plants, groundwater contamination, and negative impacts on biogeochemical cycles. Leaching, the prevailing remediation method, is expensive, energy-intensive, demands more fresh water, and also causes nutrient loss which leads to infertile cropland and eutrophication of water bodies. Moreover, in soils co-contaminated with persistent organic pollutants, heavy metals, and textile dyes, leaching techniques may not be effective. It promotes the adoption of microbial remediation as an effective and eco-friendly method. Common microbes such as Pseudomonas, Trichoderma, and Bacillus often struggle to survive in high-saline conditions due to osmotic stress, ion imbalance, and protein denaturation. Halophiles, capable of withstanding high-saline conditions, exhibit a remarkable ability to utilize a broad spectrum of organic pollutants as carbon sources and restore the polluted environment. Furthermore, halophiles can enhance plant growth under stress conditions and produce vital bio-enzymes. Halophilic microorganisms can contribute to increasing soil microbial diversity, pollutant degradation, stabilizing soil structure, participating in nutrient dynamics, bio-geochemical cycles, enhancing soil fertility, and crop growth. This review provides an in-depth analysis of pollutant degradation, salt-tolerating mechanisms, and plant-soil-microbe interaction and offers a holistic perspective on their potential for soil restoration.
Collapse
Affiliation(s)
- Ravichandran Arthi
- Department of Environmental Science, Tamil Nadu Agricultural University, Coimbatore, India
| | | | - Periyasamy Dhevagi
- Department of Environmental Science, Tamil Nadu Agricultural University, Coimbatore, India
| | - Ponnusamy Janaki
- Nammazhvar Organic Farming Research Centre, Tamil Nadu Agricultural University, Coimbatore, India
| | - Rathinasamy Parimaladevi
- Department of Bioenergy, Agrl. Engineering College & Research Institute, Tamil Nadu Agricultural University, Coimbatore, India
| |
Collapse
|
3
|
Susanty M, Naim Mursalim MK, Hertadi R, Purwarianti A, Rajab TLE. Classifying alkaliphilic proteins using embeddings from protein language model. Comput Biol Med 2024; 173:108385. [PMID: 38547659 DOI: 10.1016/j.compbiomed.2024.108385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 03/22/2024] [Accepted: 03/24/2024] [Indexed: 04/17/2024]
Abstract
Alkaliphilic proteins have great potential as biocatalysts in biotechnology, especially for enzyme engineering. Extensive research has focused on exploring the enzymatic potential of alkaliphiles and characterizing alkaliphilic proteins. However, the current method employed for identifying these proteins that requires web lab experiment is time-consuming, labor-intensive, and expensive. Therefore, the development of a computational method for alkaliphilic protein identification would be invaluable for protein engineering and design. In this study, we present a novel approach that uses embeddings from a protein language model called ESM-2(3B) in a deep learning framework to classify alkaliphilic and non-alkaliphilic proteins. To our knowledge, this is the first attempt to employ embeddings from a pre-trained protein language model to classify alkaliphilic protein. A reliable dataset comprising 1,002 alkaliphilic and 1,866 non-alkaliphilic proteins was constructed for training and testing the proposed model. The proposed model, dubbed ALPACA, achieves performance scores of 0.88, 0.84, and 0.75 for accuracy, f1-score, and Matthew correlation coefficient respectively on independent dataset. ALPACA is likely to serve as a valuable resource for exploring protein alkalinity and its role in protein design and engineering.
Collapse
Affiliation(s)
- Meredita Susanty
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Universitas Pertamina, School of Computer Science, Jl Teuku Nyak Arief Jakarta Selatan DKI Jakarta, Indonesia
| | - Muhammad Khaerul Naim Mursalim
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Universitas Universal, Kompleks Maha Vihara Duta Maitreya Bukit Beruntung, Sei Panas Batam, 29456, Kepulauan Riau, Indonesia
| | - Rukman Hertadi
- Institut Teknologi Bandung Faculty of Math and Natural Sciences, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia
| | - Ayu Purwarianti
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Center for Artificial Intelligence (U-CoE AI-VLB), Institut Teknologi Bandung, Bandung, Indonesia
| | - Tati LE Rajab
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia.
| |
Collapse
|
4
|
Dou Y, Meng W. Comparative analysis of weka-based classification algorithms on medical diagnosis datasets. Technol Health Care 2023; 31:397-408. [PMID: 37066939 DOI: 10.3233/thc-236034] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]
Abstract
BACKGROUND With the advent of 5G and the era of Big Data, the rapid development of medical information technology around the world, the massive application of electronic medical records and cases, and the digitization of medical equipment and instruments, a large amount of data has accumulated in the database system of hospitals, which includes clinical diagnosis data and hospital management data. OBJECTIVE This study aimed to examine the classification effects of different machine learning algorithms on medical datasets so as to better explore the value of machine learning methods in aiding medical diagnosis. METHODS The classification datasets of four different medical fields in the University of California Irvine machine learning database were used as the research object. Also, six categories of classification models based on the Bayesian theorem idea, integrated learning idea, and rule-based and tree-based idea were constructed using the Weka platform. RESULTS The between-group experiments showed that the Random Forest algorithm achieved the best results on the Indian liver disease patient dataset (ILPD), delivery cardiotocography (CADG), and lymphatic tractography (LYMP) datasets, followed by Bagging and partition and regression tree. In the within-group algorithm comparison experiments, the Bagging algorithm achieved better results than other algorithms based on the integration idea for 11 metrics on all datasets, mainly focusing on 2 binary datasets. Logit Boost had only 7 metrics with significant performance, and the best algorithm was Rotation Forest, with 28 metrics achieving optimal values. Among the algorithms based on tree ideas, the logistic model tree algorithm achieved optimal results on all metrics on the mammographic dataset (MAGR). The classification performance of BFTree, J48, and Random Tree was poor on each dataset. The best algorithm was Random Forest on the ILPD, CADG, and LYMP datasets with 27 metrics reaching the optimum. CONCLUSION Machine learning algorithms have good application value in disease prediction and can provide a reference basis for disease diagnosis.
Collapse
Affiliation(s)
- Yifeng Dou
- Network Information Center, Tianjin Baodi Hospital, Tianjin, China
- Baodi Clinical College, Tianjin Medical University, Tianjin, China
| | - Wentao Meng
- Network Information Center, Tianjin Baodi Hospital, Tianjin, China
- Baodi Clinical College, Tianjin Medical University, Tianjin, China
| |
Collapse
|
5
|
Garabaghi FH, Benzer R, Benzer S, Günal Ç. Effect of polynomial, radial basis, and Pearson VII function kernels in support vector machine algorithm for classification of crayfish. ECOL INFORM 2022. [DOI: 10.1016/j.ecoinf.2022.101911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
|
6
|
Integrating Qualitative Comparative Analysis and Support Vector Machine Methods to Reduce Passengers’ Resistance to Biometric E-Gates for Sustainable Airport Operations. SUSTAINABILITY 2019. [DOI: 10.3390/su11195349] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
For the sake of maintaining sustainable airport operations, biometric e-gates security systems started receiving significant attention from managers of airports around the world. Therefore, how to reduce flight passengers’ perceived resistance to the biometric e-gates security system became much more important than ever. In this sense, the purpose of this study is to analyze the factors which contribute to passenger’s resistance to adopt biometric e-gate technology within the airport security setting. Our focus lies on exploring the effects that perceived risks and benefits as well as user characteristics and propagation mechanisms had on causing such resistance. With survey data from 339 airport users, a support vector machine (SVM) model was implemented to provide a tool for classifying resistance causes correctly, and csQCA (crisp set Qualitative Comparative Analysis) was implemented in order to understand the complex underlying causes. The results showed that the presence of perceived risks and the absence of perceived benefits were the main contributing factors, with propagation mechanisms also showing a significant effect on weak and strong resistance. This study is distinct in that it has attempted to explore innovation adoption through the lens of resistance and in doing so has uncovered important complex causation conditions that need to be considered before service quality can be enhanced within airports. This study’s implications should therefore help steer airport managers in the right direction towards maintaining service quality while implementing sustainable new technologies within their current airport security ecosystem.
Collapse
|
7
|
Ruiz-Blanco YB, Agüero-Chapin G, García-Hernández E, Álvarez O, Antunes A, Green J. Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone. BMC Bioinformatics 2017; 18:349. [PMID: 28732462 PMCID: PMC5521120 DOI: 10.1186/s12859-017-1758-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Accepted: 07/13/2017] [Indexed: 11/10/2022] Open
Affiliation(s)
- Yasser B Ruiz-Blanco
- Facultad de Química y Farmacia, Universidad Central "Marta Abreu" de Las Villas, 54830, Santa Clara, Cuba.,Theoretical Chemistry, Max Planck Institute für Kohlenforschung, 45470, Mulheim an der Ruhr, Germany
| | - Guillermin Agüero-Chapin
- CIMAR/CIIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, 4450-208, Porto, Portugal. .,Centro de Bioactivos Químicos (CBQ), Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV), 54830, Santa Clara, Cuba. .,Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007, Porto, Portugal.
| | - Enrique García-Hernández
- Instituto de Química, Universidad Nacional Autónoma de México (UNAM), 04360, D.F, México, Mexico
| | - Orlando Álvarez
- Centro de Bioactivos Químicos (CBQ), Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV), 54830, Santa Clara, Cuba
| | - Agostinho Antunes
- CIMAR/CIIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, 4450-208, Porto, Portugal.,Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007, Porto, Portugal
| | - James Green
- Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada
| |
Collapse
|
8
|
Kaur B, Joshi G, Vig R. Indian sign language recognition using Krawtchouk moment-based local features. THE IMAGING SCIENCE JOURNAL 2017. [DOI: 10.1080/13682199.2017.1311524] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
9
|
Lower Order Krawtchouk Moment-Based Feature-Set for Hand Gesture Recognition. ADVANCES IN HUMAN-COMPUTER INTERACTION 2016. [DOI: 10.1155/2016/6727806] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
The capability of lower order Krawtchouk moment-based shape features has been analyzed. The behaviour of 1D and 2D Krawtchouk polynomials at lower orders is observed by varying Region of Interest (ROI). The paper measures the effectiveness of shape recognition capability of 2D Krawtchouk features at lower orders on the basis of Jochen-Triesch’s database and hand gesture database of 10 Indian Sign Language (ISL) alphabets. Comparison of original and reduced feature-set is also done. Experimental results demonstrate that the reduced feature dimensionality gives competent accuracy as compared to the original feature-set for all the proposed classifiers. Thus, the Krawtchouk moment-based features prove to be effective in terms of shape recognition capability at lower orders.
Collapse
|
10
|
Insights into the sequence parameters for halophilic adaptation. Amino Acids 2015; 48:751-762. [PMID: 26520112 DOI: 10.1007/s00726-015-2123-x] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2015] [Accepted: 10/20/2015] [Indexed: 01/04/2023]
Abstract
The sequence parameters for halophilic adaptation are still not fully understood. To understand the molecular basis of protein hypersaline adaptation, a detailed analysis is carried out, and investigated the likely association of protein sequence attributes to halophilic adaptation. A two-stage strategy is implemented, where in the first stage a supervised machine learning classifier is build, giving an overall accuracy of 86 % on stratified tenfold cross validation and 90 % on blind testing set, which are better than the previously reported results. The second stage consists of statistical analysis of sequence features and possible extraction of halophilic molecular signatures. The results of this study showed that, halophilic proteins are characterized by lower average charge, lower K content, and lower S content. A statistically significant preference/avoidance list of sequence parameters is also reported giving insights into the molecular basis of halophilic adaptation. D, Q, E, H, P, T, V are significantly preferred while N, C, I, K, M, F, S are significantly avoided. Among amino acid physicochemical groups, small, polar, charged, acidic and hydrophilic groups are preferred over other groups. The halophilic proteins also showed a preference for higher average flexibility, higher average polarity and avoidance for higher average positive charge, average bulkiness and average hydrophobicity. Some interesting trends observed in dipeptide counts are also reported. Further a systematic statistical comparison is undertaken for gaining insights into the sequence feature distribution in different residue structural states. The current analysis may facilitate the understanding of the mechanism of halophilic adaptation clearer, which can be further used for rational design of halophilic proteins.
Collapse
|
11
|
RETRACTED: Identifying halophilic proteins based on random forests with preprocessing of the pseudo-amino acid composition. J Theor Biol 2014; 361:175-81. [DOI: 10.1016/j.jtbi.2014.07.017] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2014] [Revised: 07/14/2014] [Accepted: 07/15/2014] [Indexed: 01/07/2023]
|