1
|
Gillani M, Pollastri G. Protein subcellular localization prediction tools. Comput Struct Biotechnol J 2024; 23:1796-1807. [PMID: 38707539 PMCID: PMC11066471 DOI: 10.1016/j.csbj.2024.04.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 04/11/2024] [Accepted: 04/11/2024] [Indexed: 05/07/2024] Open
Abstract
Protein subcellular localization prediction is of great significance in bioinformatics and biological research. Most of the proteins do not have experimentally determined localization information, computational prediction methods and tools have been acting as an active research area for more than two decades now. Knowledge of the subcellular location of a protein provides valuable information about its functionalities, the functioning of the cell, and other possible interactions with proteins. Fast, reliable, and accurate predictors provides platforms to harness the abundance of sequence data to predict subcellular locations accordingly. During the last decade, there has been a considerable amount of research effort aimed at developing subcellular localization predictors. This paper reviews recent subcellular localization prediction tools in the Eukaryotic, Prokaryotic, and Virus-based categories followed by a detailed analysis. Each predictor is discussed based on its main features, strengths, weaknesses, algorithms used, prediction techniques, and analysis. This review is supported by prediction tools taxonomies that highlight their rele- vant area and examples for uncomplicated categorization and ease of understandability. These taxonomies help users find suitable tools according to their needs. Furthermore, recent research gaps and challenges are discussed to cover areas that need the utmost attention. This survey provides an in-depth analysis of the most recent prediction tools to facilitate readers and can be considered a quick guide for researchers to identify and explore the recent literature advancements.
Collapse
Affiliation(s)
- Maryam Gillani
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| |
Collapse
|
2
|
Han K, Liu X, Sun G, Wang Z, Shi C, Liu W, Huang M, Liu S, Guo Q. Enhancing subcellular protein localization mapping analysis using Sc2promap utilizing attention mechanisms. Biochim Biophys Acta Gen Subj 2024; 1868:130601. [PMID: 38522679 DOI: 10.1016/j.bbagen.2024.130601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 02/17/2024] [Accepted: 03/15/2024] [Indexed: 03/26/2024]
Abstract
BACKGROUND Aberrant protein localization is a prominent feature in many human diseases and can have detrimental effects on the function of specific tissues and organs. High-throughput technologies, which continue to advance with iterations of automated equipment and the development of bioinformatics, enable the acquisition of large-scale data that are more pattern-rich, allowing for the use of a wider range of methods to extract useful patterns and knowledge from them. METHODS The proposed sc2promap (Spatial and Channel for SubCellular Protein Localization Mapping) model, designed to proficiently extract meaningful features from a vast repository of single-channel grayscale protein images for the purposes of protein localization analysis and clustering. Sc2promap incorporates a prediction head component enriched with supplementary protein annotations, along with the integration of a spatial-channel attention mechanism within the encoder to enables the generation of high-resolution protein localization maps that encapsulate the fundamental characteristics of cells, including elemental cellular localizations such as nuclear and non-nuclear domains. RESULTS Qualitative and quantitative comparisons were conducted across internal and external clustering evaluation metrics, as well as various facets of the clustering results. The study also explored different components of the model. The research outcomes conclusively indicate that, in comparison to previous methods, Sc2promap exhibits superior performance. CONCLUSIONS The amalgamation of the attention mechanism and prediction head components has led the model to excel in protein localization clustering and analysis tasks. GENERAL SIGNIFICANCE The model effectively enhances the capability to extract features and knowledge from protein fluorescence images.
Collapse
Affiliation(s)
- Kaitai Han
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Xi Liu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Guocheng Sun
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Zijun Wang
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Chaojing Shi
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Wu Liu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Mengyuan Huang
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Shitou Liu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Qianjin Guo
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China.
| |
Collapse
|
3
|
Moon J, Hu G, Hayashi T. Application of Machine Learning in the Quantitative Analysis of the Surface Characteristics of Highly Abundant Cytoplasmic Proteins: Toward AI-Based Biomimetics. Biomimetics (Basel) 2024; 9:162. [PMID: 38534847 DOI: 10.3390/biomimetics9030162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 02/12/2024] [Accepted: 02/29/2024] [Indexed: 03/28/2024] Open
Abstract
Proteins in the crowded environment of human cells have often been studied regarding nonspecific interactions, misfolding, and aggregation, which may cause cellular malfunction and disease. Specifically, proteins with high abundance are more susceptible to these issues due to the law of mass action. Therefore, the surfaces of highly abundant cytoplasmic (HAC) proteins directly exposed to the environment can exhibit specific physicochemical, structural, and geometrical characteristics that reduce nonspecific interactions and adapt to the environment. However, the quantitative relationships between the overall surface descriptors still need clarification. Here, we used machine learning to identify HAC proteins using hydrophobicity, charge, roughness, secondary structures, and B-factor from the protein surfaces and quantified the contribution of each descriptor. First, several supervised learning algorithms were compared to solve binary classification problems for the surfaces of HAC and extracellular proteins. Then, logistic regression was used for the feature importance analysis of descriptors considering model performance (80.2% accuracy and 87.6% AUC) and interpretability. The HAC proteins showed positive correlations with negatively and positively charged areas but negative correlations with hydrophobicity, the B-factor, the proportion of beta structures, roughness, and the proportion of disordered regions. Finally, the details of each descriptor could be explained concerning adaptative surface strategies of HAC proteins to regulate nonspecific interactions, protein folding, flexibility, stability, and adsorption. This study presented a novel approach using various surface descriptors to identify HAC proteins and provided quantitative design rules for the surfaces well-suited to human cellular crowded environments.
Collapse
Affiliation(s)
- Jooa Moon
- Department of Materials Science and Engineering, School of Materials and Chemical Technology, Tokyo Institute of Technology, Yokohama 226-8502, Japan
| | - Guanghao Hu
- Department of Materials Science and Engineering, School of Materials and Chemical Technology, Tokyo Institute of Technology, Yokohama 226-8502, Japan
| | - Tomohiro Hayashi
- Department of Materials Science and Engineering, School of Materials and Chemical Technology, Tokyo Institute of Technology, Yokohama 226-8502, Japan
- The Institute for Solid State Physics, The University of Tokyo, Kashiwa 277-0882, Japan
| |
Collapse
|
4
|
Wang C, Wang Y, Ding P, Li S, Yu X, Yu B. ML-FGAT: Identification of multi-label protein subcellular localization by interpretable graph attention networks and feature-generative adversarial networks. Comput Biol Med 2024; 170:107944. [PMID: 38215617 DOI: 10.1016/j.compbiomed.2024.107944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 12/08/2023] [Accepted: 01/01/2024] [Indexed: 01/14/2024]
Abstract
The prediction of multi-label protein subcellular localization (SCL) is a pivotal area in bioinformatics research. Recent advancements in protein structure research have facilitated the application of graph neural networks. This paper introduces a novel approach termed ML-FGAT. The approach begins by extracting node information of proteins from sequence data, physical-chemical properties, evolutionary insights, and structural details. Subsequently, various evolutionary techniques are integrated to consolidate multi-view information. A linear discriminant analysis framework, grounded on entropy weight, is then employed to reduce the dimensionality of the merged features. To enhance the robustness of the model, the training dataset is augmented using feature-generative adversarial networks. For the primary prediction step, graph attention networks are employed to determine multi-label protein SCL, leveraging both node and neighboring information. The interpretability is enhanced by analyzing the attention weight parameters. The training is based on the Gram-positive bacteria dataset, while validation employs newly constructed datasets: human, virus, Gram-negative bacteria, plant, and SARS-CoV-2. Following a leave-one-out cross-validation procedure, ML-FGAT demonstrates noteworthy superiority in this domain.
Collapse
Affiliation(s)
- Congjing Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yifei Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Pengju Ding
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Shan Li
- School of Mathematics and Statistics, Central South University, Changsha, 410083, China
| | - Xu Yu
- Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum, Qingdao, 266580, China
| | - Bin Yu
- School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
5
|
MSResG: Using GAE and Residual GCN to Predict Drug-Drug Interactions Based on Multi-source Drug Features. Interdiscip Sci 2023; 15:171-188. [PMID: 36646843 DOI: 10.1007/s12539-023-00550-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 01/05/2023] [Accepted: 01/07/2023] [Indexed: 01/18/2023]
Abstract
Drug-drug interaction refers to taking the two drugs may produce certain reaction which may be a threat to patients' health, or enhance the efficacy helpful for medical work. Therefore, it is necessary to study and predict it. In fact, traditional experimental methods can be used for drug-drug interaction prediction, but they are time-consuming and costly, so we prefer to use more accurate and convenient calculation methods to predict the unknown drug-drug interaction. In this paper, we proposed a deep learning framework called MSResG that considers multi-sources features of drugs and combines them with Graph Auto-Encoder to predicting. Firstly, the model obtains four feature representations of drugs from the database, namely, chemical substructure, target, pathway and enzyme, and then calculates the Jaccard similarity of the drugs. To balance different drug features, we perform similarity integration by finding the mean value. Then we will be comprehensive similarity network combined with drug interaction network, and encodes and decodes it using the graph auto-encoder based on residual graph convolution network. Encoding is to learn the potential feature vectors of drugs, which contain similar information and interaction information. Decoding is to reconstruct the network to predict unknown drug-drug interaction. The experimental results show that our model has advanced performance and is superior to other existing advanced methods. Case study also shows that MSResG has practical significance.
Collapse
|