1
|
Alanazi W, Meng D, Pollastri G. Advancements in one-dimensional protein structure prediction using machine learning and deep learning. Comput Struct Biotechnol J 2025; 27:1416-1430. [PMID: 40242292 PMCID: PMC12002955 DOI: 10.1016/j.csbj.2025.04.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2025] [Revised: 04/01/2025] [Accepted: 04/02/2025] [Indexed: 04/18/2025] Open
Abstract
The accurate prediction of protein structures remains a cornerstone challenge in structural bioinformatics, essential for understanding the intricate relationship between protein sequence, structure, and function. Recent advancements in Machine Learning (ML) and Deep Learning (DL) have revolutionized this field, offering innovative approaches to tackle one- dimensional (1D) protein structure annotations, including secondary structure, solvent accessibility, and intrinsic disorder. This review highlights the evolution of predictive methodologies, from early machine learning models to sophisticated deep learning frameworks that integrate sequence embeddings and pretrained language models. Key advancements, such as AlphaFold's transformative impact on structure prediction and the rise of protein language models (PLMs), have enabled unprecedented accuracy in capturing sequence-structure relationships. Furthermore, we explore the role of specialized datasets, benchmarking competitions, and multimodal integration in shaping state-of-the-art prediction models. By addressing challenges in data quality, scalability, interpretability, and task-specific optimization, this review underscores the transformative impact of ML, DL, and PLMs on 1D protein prediction while providing insights into emerging trends and future directions in this rapidly evolving field.
Collapse
Affiliation(s)
- Wafa Alanazi
- School of Computer Science, University College Dublin, Belfield, Dublin D04 C1P1, Ireland
- Department of Computer Science, College of Science, Northern Border University, Arar, Saudi Arabia
| | - Di Meng
- School of Computer Science, University College Dublin, Belfield, Dublin D04 C1P1, Ireland
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin, Belfield, Dublin D04 C1P1, Ireland
| |
Collapse
|
2
|
Yuan GH, Li J, Yang Z, Chen YQ, Yuan Z, Chen T, Ouyang W, Dong N, Yang L. Deep generative model for protein subcellular localization prediction. Brief Bioinform 2025; 26:bbaf152. [PMID: 40211979 PMCID: PMC11986326 DOI: 10.1093/bib/bbaf152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2024] [Revised: 03/18/2025] [Accepted: 03/19/2025] [Indexed: 04/14/2025] Open
Abstract
Protein sequence not only determines its structure but also provides important clues of its subcellular localization. Although a series of artificial intelligence models have been reported to predict protein subcellular localization, most of them provide only textual outputs. Here, we present deepGPS, a deep generative model for protein subcellular localization prediction. After training with protein primary sequences and fluorescence images, deepGPS shows the ability to predict cytoplasmic and nuclear localizations by reporting both textual labels and generative images as outputs. In addition, cell-type-specific deepGPS models can be developed by using distinct image datasets from different cell lines for comparative analyses. Moreover, deepGPS shows potential to be further extended for other specific organelles, such as vesicles and endoplasmic reticulum, even with limited volumes of training data. Finally, the openGPS website (https://bits.fudan.edu.cn/opengps) is constructed to provide a publicly accessible and user-friendly platform for studying protein subcellular localization and function.
Collapse
Affiliation(s)
- Guo-Hua Yuan
- Center for Molecular Medicine, Children’s Hospital of Fudan University and Shanghai Key Laboratory of Medical Epigenetics, International Laboratory of Medical Epigenetics and Metabolism, Ministry of Science and Technology, Institutes of Biomedical Sciences, Fudan University, 131 Dongan Road, Xuhui District, Shanghai 200032, China
| | - Jinzhe Li
- Shanghai Artificial Intelligence Laboratory, 129 Longwen Road, Xuhui District, Shanghai 200232, China
- School of Information Science and Technology, Fudan University, 2005 Songhu Road, Yangpu District, Shanghai 200433, China
| | - Zejun Yang
- Shanghai Artificial Intelligence Laboratory, 129 Longwen Road, Xuhui District, Shanghai 200232, China
| | - Yao-Qi Chen
- Center for Molecular Medicine, Children’s Hospital of Fudan University and Shanghai Key Laboratory of Medical Epigenetics, International Laboratory of Medical Epigenetics and Metabolism, Ministry of Science and Technology, Institutes of Biomedical Sciences, Fudan University, 131 Dongan Road, Xuhui District, Shanghai 200032, China
| | - Zhonghang Yuan
- Shanghai Artificial Intelligence Laboratory, 129 Longwen Road, Xuhui District, Shanghai 200232, China
| | - Tao Chen
- School of Information Science and Technology, Fudan University, 2005 Songhu Road, Yangpu District, Shanghai 200433, China
| | - Wanli Ouyang
- Shanghai Artificial Intelligence Laboratory, 129 Longwen Road, Xuhui District, Shanghai 200232, China
| | - Nanqing Dong
- Shanghai Artificial Intelligence Laboratory, 129 Longwen Road, Xuhui District, Shanghai 200232, China
- Shanghai Innovation Institute, 699 Huafa Road, Xuhui District, Shanghai 200231, China
| | - Li Yang
- Center for Molecular Medicine, Children’s Hospital of Fudan University and Shanghai Key Laboratory of Medical Epigenetics, International Laboratory of Medical Epigenetics and Metabolism, Ministry of Science and Technology, Institutes of Biomedical Sciences, Fudan University, 131 Dongan Road, Xuhui District, Shanghai 200032, China
| |
Collapse
|
3
|
Hu G, Moon J, Hayashi T. Protein Classes Predicted by Molecular Surface Chemical Features: Machine Learning-Assisted Classification of Cytosol and Secreted Proteins. J Phys Chem B 2024; 128:8423-8436. [PMID: 39185763 PMCID: PMC11382266 DOI: 10.1021/acs.jpcb.4c02461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
Chemical structures of protein surfaces govern intermolecular interaction, and protein functions include specific molecular recognition, transport, self-assembly, etc. Therefore, the relationship between the chemical structure and protein functions provides insights into the understanding of the mechanism underlying protein functions and developments of new biomaterials. In this study, we analyze protein surface features, including surface amino acid populations and secondary structure ratios, instead of entire sequences as input for the classifier, intending to provide deeper insights into the determination of protein classes (cytosol or secreted). We employed a random forest-based classifier for the prediction of protein locations. Our training and testing data sets consisting of secreted and cytosol proteins were constructed using filtered information from UniProt and 3D structures from AlphaFold. The classifier achieved a testing accuracy of 93.9% with a feature importance ranking and quantitative boundary values for the top three features. We discuss the significance of these features quantitatively and the hidden rules to determine the protein classes (cytosol or secreted).
Collapse
Affiliation(s)
- Guanghao Hu
- Department of Materials Science and Engineering, School of Materials Science and Chemical Technology, Tokyo Institute of Technology, 4259 Nagatsuta-cho, Midori-ku, Yokohama-shi, Kanagawa-ken 226-8502, Japan
| | - Jooa Moon
- Department of Materials Science and Engineering, School of Materials Science and Chemical Technology, Tokyo Institute of Technology, 4259 Nagatsuta-cho, Midori-ku, Yokohama-shi, Kanagawa-ken 226-8502, Japan
| | - Tomohiro Hayashi
- Department of Materials Science and Engineering, School of Materials Science and Chemical Technology, Tokyo Institute of Technology, 4259 Nagatsuta-cho, Midori-ku, Yokohama-shi, Kanagawa-ken 226-8502, Japan
- The Institute for Solid State Physics, The University of Tokyo, 5-1-5, Kashiwanoha, Kashiwa, Chiba 277-0882, Japan
| |
Collapse
|
4
|
Han K, Liu X, Sun G, Wang Z, Shi C, Liu W, Huang M, Liu S, Guo Q. Enhancing subcellular protein localization mapping analysis using Sc2promap utilizing attention mechanisms. Biochim Biophys Acta Gen Subj 2024; 1868:130601. [PMID: 38522679 DOI: 10.1016/j.bbagen.2024.130601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 02/17/2024] [Accepted: 03/15/2024] [Indexed: 03/26/2024]
Abstract
BACKGROUND Aberrant protein localization is a prominent feature in many human diseases and can have detrimental effects on the function of specific tissues and organs. High-throughput technologies, which continue to advance with iterations of automated equipment and the development of bioinformatics, enable the acquisition of large-scale data that are more pattern-rich, allowing for the use of a wider range of methods to extract useful patterns and knowledge from them. METHODS The proposed sc2promap (Spatial and Channel for SubCellular Protein Localization Mapping) model, designed to proficiently extract meaningful features from a vast repository of single-channel grayscale protein images for the purposes of protein localization analysis and clustering. Sc2promap incorporates a prediction head component enriched with supplementary protein annotations, along with the integration of a spatial-channel attention mechanism within the encoder to enables the generation of high-resolution protein localization maps that encapsulate the fundamental characteristics of cells, including elemental cellular localizations such as nuclear and non-nuclear domains. RESULTS Qualitative and quantitative comparisons were conducted across internal and external clustering evaluation metrics, as well as various facets of the clustering results. The study also explored different components of the model. The research outcomes conclusively indicate that, in comparison to previous methods, Sc2promap exhibits superior performance. CONCLUSIONS The amalgamation of the attention mechanism and prediction head components has led the model to excel in protein localization clustering and analysis tasks. GENERAL SIGNIFICANCE The model effectively enhances the capability to extract features and knowledge from protein fluorescence images.
Collapse
Affiliation(s)
- Kaitai Han
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Xi Liu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Guocheng Sun
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Zijun Wang
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Chaojing Shi
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Wu Liu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Mengyuan Huang
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Shitou Liu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China
| | - Qianjin Guo
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing 102617, China.
| |
Collapse
|
5
|
Arora I, Kummer A, Zhou H, Gadjeva M, Ma E, Chuang GY, Ong E. mtx-COBRA: Subcellular localization prediction for bacterial proteins. Comput Biol Med 2024; 171:108114. [PMID: 38401450 DOI: 10.1016/j.compbiomed.2024.108114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 01/23/2024] [Accepted: 02/04/2024] [Indexed: 02/26/2024]
Abstract
BACKGROUND Bacteria can have beneficial effects on our health and environment; however, many are responsible for serious infectious diseases, warranting the need for vaccines against such pathogens. Bioinformatic and experimental technologies are crucial for the development of vaccines. The vaccine design pipeline requires identification of bacteria-specific antigens that can be recognized and can induce a response by the immune system upon infection. Immune system recognition is influenced by the location of a protein. Methods have been developed to determine the subcellular localization (SCL) of proteins in prokaryotes and eukaryotes. Bioinformatic tools such as PSORTb can be employed to determine SCL of proteins, which would be tedious to perform experimentally. Unfortunately, PSORTb often predicts many proteins as having an "Unknown" SCL, reducing the number of antigens to evaluate as potential vaccine targets. METHOD We present a new pipeline called subCellular lOcalization prediction for BacteRiAl Proteins (mtx-COBRA). mtx-COBRA uses Meta's protein language model, Evolutionary Scale Modeling, combined with an Extreme Gradient Boosting machine learning model to identify SCL of bacterial proteins based on amino acid sequence. This pipeline is trained on a curated dataset that combines data from UniProt and the publicly available ePSORTdb dataset. RESULTS Using benchmarking analyses, nested 5-fold cross-validation, and leave-one-pathogen-out methods, followed by testing on the held-out dataset, we show that our pipeline predicts the SCL of bacterial proteins more accurately than PSORTb. CONCLUSIONS mtx-COBRA provides an accessible pipeline that can more efficiently classify bacterial proteins with currently "Unknown" SCLs than existing bioinformatic and experimental methods.
Collapse
Affiliation(s)
- Isha Arora
- Moderna, Inc., 200 Technology Square, Cambridge, MA 02139, USA
| | - Arkadij Kummer
- Moderna, Inc., 200 Technology Square, Cambridge, MA 02139, USA
| | - Hao Zhou
- Moderna, Inc., 200 Technology Square, Cambridge, MA 02139, USA
| | - Mihaela Gadjeva
- Moderna, Inc., 200 Technology Square, Cambridge, MA 02139, USA
| | - Eric Ma
- Moderna, Inc., 200 Technology Square, Cambridge, MA 02139, USA
| | - Gwo-Yu Chuang
- Moderna, Inc., 200 Technology Square, Cambridge, MA 02139, USA
| | - Edison Ong
- Moderna, Inc., 200 Technology Square, Cambridge, MA 02139, USA.
| |
Collapse
|
6
|
Nielsen H, Teufel F, Brunak S, von Heijne G. SignalP: The Evolution of a Web Server. Methods Mol Biol 2024; 2836:331-367. [PMID: 38995548 DOI: 10.1007/978-1-0716-4007-4_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
SignalP ( https://services.healthtech.dtu.dk/services/SignalP-6.0/ ) is a very popular prediction method for signal peptides, the intrinsic signals that make proteins secretory. The SignalP web server has existed since 1995 and is now in its sixth major version. In this historical account, we (three authors who have taken part in the entire journey plus the first author of the latest version) describe the differences between the versions and discuss the various decisions taken along the way.
Collapse
Affiliation(s)
- Henrik Nielsen
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark.
| | - Felix Teufel
- Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark
- Digital Science & Innovation, Novo Nordisk A/S, Malov, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Gunnar von Heijne
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
- Science for Life Laboratory, Stockholm University, Solna, Sweden
| |
Collapse
|
7
|
Nielsen H. Protein Sorting Prediction. Methods Mol Biol 2024; 2715:27-63. [PMID: 37930519 DOI: 10.1007/978-1-0716-3445-5_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2023]
Abstract
Many computational methods are available for predicting protein sorting in bacteria. When comparing them, it is important to know that they can be grouped into three fundamentally different approaches: signal-based, global property-based, and homology-based prediction. In this chapter, the strengths and drawbacks of each of these approaches are described through many examples of methods that predict secretion, integration into membranes, or subcellular locations in general. The aim of this chapter is to provide a user-level introduction to the field with a minimum of computational theory.
Collapse
Affiliation(s)
- Henrik Nielsen
- Department of Health Technology, Technical University of Denmark, Lyngby, Denmark.
| |
Collapse
|
8
|
Zhao Y, Yang Z, Hong Y, Yang Y, Wang L, Zhang Y, Lin H, Wang J. Protein Function Prediction With Functional and Topological Knowledge of Gene Ontology. IEEE Trans Nanobioscience 2023; 22:755-762. [PMID: 37204950 DOI: 10.1109/tnb.2023.3278033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Gene Ontology (GO) is a widely used bioinformatics resource for describing biological processes, molecular functions, and cellular components of proteins. It covers more than 5000 terms hierarchically organized into a directed acyclic graph and known functional annotations. Automatically annotating protein functions by using GO-based computational models has been an area of active research for a long time. However, due to the limited functional annotation information and complex topological structures of GO, existing models cannot effectively capture the knowledge representation of GO. To solve this issue, we present a method that fuses the functional and topological knowledge of GO to guide protein function prediction. This method employs a multi-view GCN model to extract a variety of GO representations from functional information, topological structure, and their combinations. To dynamically learn the significance weights of these representations, it adopts an attention mechanism to learn the final knowledge representation of GO. Furthermore, it uses a pre-trained language model (i.e., ESM-1b) to efficiently learn biological features for each protein sequence. Finally, it obtains all predicted scores by calculating the dot product of sequence features and GO representation. Our method outperforms other state-of-the-art methods, as demonstrated by the experimental results on datasets from three different species, namely Yeast, Human and Arabidopsis. Our proposed method's code can be accessed at: https://github.com/Candyperfect/Master.
Collapse
|
9
|
Faiz M, Khan SJ, Azim F, Ejaz N. Disclosing the locale of transmembrane proteins within cellular alcove by machine learning approach: systematic review and meta analysis. J Biomol Struct Dyn 2023; 42:11133-11148. [PMID: 37768108 DOI: 10.1080/07391102.2023.2260490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Accepted: 09/13/2023] [Indexed: 09/29/2023]
Abstract
Protein subcellular localization is a promising research question in Proteomics and associated fields, including Biological Sciences, Biomedical Engineering, Computational Biology, Bioinformatics, Proteomics, Artificial Intelligence, and Biophysics. However, computational techniques are preferred to explore this attribute for a massive number of proteins. The byproduct of this conjunction yields diversified location identifiers of proteins. These protein subcellular localization identifiers are unique regarding the database used, organisms, Machine Learning Technique, and accuracy. Despite the availability of these identifiers, the majority of the work has been done on the subcellular localization of proteins and, less work has been done specifically on locations of transmembrane proteins. This systematic review accounts for computational techniques implemented on transmembrane protein localization. Moreover, a literature search on PubMed, Science Direct, and IEEE Databases disclosed no systematic review or meta-analysis on the cell's transmembrane protein locale. A Systematic review was formed under the guidelines of PRISMA by using Science Direct, PubMed, and IEEE Databases. Journal publications from 2000 to 2023 were taken into consideration and screened. This review has focused only on computational studies rather than experimental techniques. 1004 studies were reviewed and were categorized as relevant and non-relevant according to inclusion and exclusion criteria. All the screening was done through Endnote after importing citations. This systematic review characterizes the gap in targeting the locale of the transmembrane protein and will aid researchers in exploring its new horizons.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Mehwish Faiz
- Department of Biomedical Engineering, Ziauddin University (FESTM), Karachi, Pakistan
- Department of Electrical Engineering, Ziauddin University, (FESTM), Karachi, Pakistan
| | - Saad Jawaid Khan
- Department of Biomedical Engineering, Ziauddin University (FESTM), Karachi, Pakistan
| | - Fahad Azim
- Department of Electrical Engineering, Ziauddin University, (FESTM), Karachi, Pakistan
| | - Nazia Ejaz
- Balochistan University of Engineering and Technology, Khuzdar, Pakistan
| |
Collapse
|
10
|
Calvo Córdoba A, García Cena CE, Montoliu C. Automatic Video-Oculography System for Detection of Minimal Hepatic Encephalopathy Using Machine Learning Tools. SENSORS (BASEL, SWITZERLAND) 2023; 23:8073. [PMID: 37836903 PMCID: PMC10575013 DOI: 10.3390/s23198073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 09/18/2023] [Accepted: 09/21/2023] [Indexed: 10/15/2023]
Abstract
This article presents an automatic gaze-tracker system to assist in the detection of minimal hepatic encephalopathy by analyzing eye movements with machine learning tools. To record eye movements, we used video-oculography technology and developed automatic feature-extraction software as well as a machine learning algorithm to assist clinicians in the diagnosis. In order to validate the procedure, we selected a sample (n=47) of cirrhotic patients. Approximately half of them were diagnosed with minimal hepatic encephalopathy (MHE), a common neurological impairment in patients with liver disease. By using the actual gold standard, the Psychometric Hepatic Encephalopathy Score battery, PHES, patients were classified into two groups: cirrhotic patients with MHE and those without MHE. Eye movement tests were carried out on all participants. Using classical statistical concepts, we analyzed the significance of 150 eye movement features, and the most relevant (p-values ≤ 0.05) were selected for training machine learning algorithms. To summarize, while the PHES battery is a time-consuming exploration (between 25-40 min per patient), requiring expert training and not amenable to longitudinal analysis, the automatic video oculography is a simple test that takes between 7 and 10 min per patient and has a sensitivity and a specificity of 93%.
Collapse
Affiliation(s)
- Alberto Calvo Córdoba
- Escuela Técnica Superior de Ingenieros Industriales, Center for Automation and Robotics, UPM-CSIC, Universidad Politécnica de Madrid, José Gutiérrez Abascal St., 2, 28006 Madrid, Spain
| | - Cecilia E. García Cena
- Escuela Técnica Superior de Ingeniería y Diseño Industrial, Center for Automation and Robotics, UPM-CSIC, Universidad Politécnica de Madrid, Ronda de Valencia, 3, 28012 Madrid, Spain;
| | - Carmina Montoliu
- Instituto de Investigación Sanitaria-INCLIVA, 46010 Valencia, Spain;
- Servicio de Medicina Digestiva, Hospital Clínico de Valencia, 46010 Valencia, Spain
| |
Collapse
|
11
|
Li J, Zou Q, Yuan L. A review from biological mapping to computation-based subcellular localization. MOLECULAR THERAPY. NUCLEIC ACIDS 2023; 32:507-521. [PMID: 37215152 PMCID: PMC10192651 DOI: 10.1016/j.omtn.2023.04.015] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Subcellular localization is crucial to the study of virus and diseases. Specifically, research on protein subcellular localization can help identify clues between virus and host cells that can aid in the design of targeted drugs. Research on RNA subcellular localization is significant for human diseases (such as Alzheimer's disease, colon cancer, etc.). To date, only reviews addressing subcellular localization of proteins have been published, which are outdated for reference, and reviews of RNA subcellular localization are not comprehensive. Therefore, we collated (the most up-to-date) literature on protein and RNA subcellular localization to help researchers understand changes in the field of protein and RNA subcellular localization. Extensive and complete methods for constructing subcellular localization models have also been summarized, which can help readers understand the changes in application of biotechnology and computer science in subcellular localization research and explore how to use biological data to construct improved subcellular localization models. This paper is the first review to cover both protein subcellular localization and RNA subcellular localization. We urge researchers from biology and computational biology to jointly pay attention to transformation patterns, interrelationships, differences, and causality of protein subcellular localization and RNA subcellular localization.
Collapse
Affiliation(s)
- Jing Li
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, 1 Chengdian Road, Quzhou, Zhejiang 324000, China
- School of Biomedical Sciences, University of Hong Kong, Hong Kong, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, 1 Chengdian Road, Quzhou, Zhejiang 324000, China
| | - Lei Yuan
- Department of Hepatobiliary Surgery, Quzhou People's Hospital, 100 Minjiang Main Road, Quzhou, Zhejiang 324000, China
| |
Collapse
|
12
|
Jin HL, Duan S, Zhang P, Yang Z, Zeng Y, Chen Z, Hong L, Li M, Luo L, Chang Z, Hu J, Wang HB. Dual roles for CND1 in maintenance of nuclear and chloroplast genome stability in plants. Cell Rep 2023; 42:112268. [PMID: 36933214 DOI: 10.1016/j.celrep.2023.112268] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Revised: 12/19/2022] [Accepted: 02/28/2023] [Indexed: 03/19/2023] Open
Abstract
The coordination of chloroplast and nuclear genome status is critical for plant cell function. Here, we report that Arabidopsis CHLOROPLAST AND NUCLEUS DUAL-LOCALIZED PROTEIN 1 (CND1) maintains genome stability in the chloroplast and the nucleus. CND1 localizes to both compartments, and complete loss of CND1 results in embryo lethality. Partial loss of CND1 disturbs nuclear cell-cycle progression and photosynthetic activity. CND1 binds to nuclear pre-replication complexes and DNA replication origins and regulates nuclear genome stability. In chloroplasts, CND1 interacts with and facilitates binding of the regulator of chloroplast genome stability WHY1 to chloroplast DNA. The defects in nuclear cell-cycle progression and photosynthesis of cnd1 mutants are respectively rescued by compartment-restricted CND1 localization. Light promotes the association of CND1 with HSP90 and its import into chloroplasts. This study provides a paradigm of the convergence of genome status across organelles to coordinately regulate cell cycle to control plant growth and development.
Collapse
Affiliation(s)
- Hong-Lei Jin
- Institute of Medical Plant Physiology and Ecology, School of Pharmaceutical Sciences, Guangzhou University of Chinese Medicine, Guangzhou 510006, People's Republic of China; Guangzhou Key Laboratory of Chinese Medicine Research on Prevention and Treatment of Osteoporosis, The Third Affiliated Hospital of Guangzhou University of Chinese Medicine, No. 263, Longxi Avenue, Guangzhou, China.
| | - Sujuan Duan
- Institute of Medical Plant Physiology and Ecology, School of Pharmaceutical Sciences, Guangzhou University of Chinese Medicine, Guangzhou 510006, People's Republic of China
| | - Pengxiang Zhang
- Institute of Medical Plant Physiology and Ecology, School of Pharmaceutical Sciences, Guangzhou University of Chinese Medicine, Guangzhou 510006, People's Republic of China
| | - Ziyue Yang
- Institute of Medical Plant Physiology and Ecology, School of Pharmaceutical Sciences, Guangzhou University of Chinese Medicine, Guangzhou 510006, People's Republic of China
| | - Yunping Zeng
- Institute of Medical Plant Physiology and Ecology, School of Pharmaceutical Sciences, Guangzhou University of Chinese Medicine, Guangzhou 510006, People's Republic of China
| | - Ziqi Chen
- Institute of Medical Plant Physiology and Ecology, School of Pharmaceutical Sciences, Guangzhou University of Chinese Medicine, Guangzhou 510006, People's Republic of China
| | - Liu Hong
- Institute of Medical Plant Physiology and Ecology, School of Pharmaceutical Sciences, Guangzhou University of Chinese Medicine, Guangzhou 510006, People's Republic of China
| | - Mengshu Li
- School of Life Sciences, Sun Yat-sen University, Guangzhou 510275, People's Republic of China
| | - Lujun Luo
- School of Life Sciences, Sun Yat-sen University, Guangzhou 510275, People's Republic of China
| | - Zhenyi Chang
- Institute of Medical Plant Physiology and Ecology, School of Pharmaceutical Sciences, Guangzhou University of Chinese Medicine, Guangzhou 510006, People's Republic of China
| | - Jiliang Hu
- Institute of Medical Plant Physiology and Ecology, School of Pharmaceutical Sciences, Guangzhou University of Chinese Medicine, Guangzhou 510006, People's Republic of China
| | - Hong-Bin Wang
- Institute of Medical Plant Physiology and Ecology, School of Pharmaceutical Sciences, Guangzhou University of Chinese Medicine, Guangzhou 510006, People's Republic of China; Key Laboratory of Chinese Medicinal Resource from Lingnan (Guangzhou University of Chinese Medicine), Ministry of Education, Guangzhou, China; State Key Laboratory of Dampness Syndrome of Chinese Medicine, Guangzhou University of Chinese Medicine, Guangzhou, China.
| |
Collapse
|
13
|
Kaur H, Singh V, Kalia M, Mohan B, Taneja N. Identification and functional annotation of hypothetical proteins of uropathogenic Escherichia coli strain CFT073 towards designing antimicrobial drug targets. J Biomol Struct Dyn 2022; 40:14084-14095. [PMID: 34751095 DOI: 10.1080/07391102.2021.2000499] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Urinary tract infections are a serious health concern worldwide, especially in developing countries. Escherichia coli strain CFT073 is a highly virulent pathogenic bacterial strain. CFT073 proteome contains 4897 proteins, out of which 992 have been classified as hypothetical proteins. Identification and characterization of hypothetical proteins can aid in the selection of targets for drug design. In this study, we studied the hypothetical proteins from the UPEC strain CFT073 using various computational tools. By NCBI-CDD, 376 protein sequences showed conserved domains. Based on the functional motifs in their primary sequences, we classified these 376 hypothetical proteins into 7 functional categories. Further KEGG database was used to find the roles of these hypothetical proteins in several pathways. Protein interaction network analysis of hypothetical proteins identified 53 proteins as highly interacting metabolic proteins. Virulence factor analysis of the proteins identified 8 proteins as virulent. We conducted a non-homology search for the identified proteins of UPEC in the available human proteome. We observed that 35 proteins are non-homologous to humans and hence could be selected for drug designing targets. Qualitative characterization of the selected 35 non-homologous hypothetical proteins including essentiality analysis and evaluation of druggability by similarity search against drug bank database was performed. Out of these 35 proteins, three-dimensional structures of six proteins (NP_752562.1, NP_756345.1, NP_754893.1, NP_756600.2, NP_755264.1 and NP_752994.1) could be successfully modelled. These new annotations can help to better understand disease mechanisms at the molecular level, as well as provide new targets for drug development against the UPEC strain CFT073.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Harpreet Kaur
- Department of Medical Microbiology, Postgraduate Institute of Medical Education and Research, Chandigarh, India
| | - Vikram Singh
- Center of Computational Biology and Bioinformatics, Central University of Himachal Pradesh, Dharamshala, India
| | - Manmohit Kalia
- Department of Biology, State University of New York, Binghamton, NY, USA
| | - Balvinder Mohan
- Department of Medical Microbiology, Postgraduate Institute of Medical Education and Research, Chandigarh, India
| | - Neelam Taneja
- Department of Medical Microbiology, Postgraduate Institute of Medical Education and Research, Chandigarh, India
| |
Collapse
|
14
|
Subcellular Localization Prediction of Human Proteins Using Multifeature Selection Methods. BIOMED RESEARCH INTERNATIONAL 2022; 2022:3288527. [PMID: 36132086 PMCID: PMC9484878 DOI: 10.1155/2022/3288527] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Accepted: 08/30/2022] [Indexed: 11/25/2022]
Abstract
Subcellular localization attempts to assign proteins to one of the cell compartments that performs specific biological functions. Finding the link between proteins, biological functions, and subcellular localization is an effective way to investigate the general organization of living cells in a systematic manner. However, determining the subcellular localization of proteins by traditional experimental approaches is difficult. Here, protein–protein interaction networks, functional enrichment on gene ontology and pathway, and a set of proteins having confirmed subcellular localization were applied to build prediction models for human protein subcellular localizations. To build an effective predictive model, we employed a variety of robust machine learning algorithms, including Boruta feature selection, minimum redundancy maximum relevance, Monte Carlo feature selection, and LightGBM. Then, the incremental feature selection method with random forest and support vector machine was used to discover the essential features. Furthermore, 38 key features were determined by integrating results of different feature selection methods, which may provide critical insights into the subcellular location of proteins. Their biological functions of subcellular localizations were discussed according to recent publications. In summary, our computational framework can help advance the understanding of subcellular localization prediction techniques and provide a new perspective to investigate the patterns of protein subcellular localization and their biological importance.
Collapse
|
15
|
Masnoddin M, Ling CMWV, Yusof NA. Functional Analysis of Conserved Hypothetical Proteins from the Antarctic Bacterium, Pedobacter cryoconitis Strain BG5 Reveals Protein Cold Adaptation and Thermal Tolerance Strategies. Microorganisms 2022; 10:microorganisms10081654. [PMID: 36014072 PMCID: PMC9415557 DOI: 10.3390/microorganisms10081654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2022] [Revised: 08/04/2022] [Accepted: 08/12/2022] [Indexed: 11/16/2022] Open
Abstract
Pedobacter cryoconitis BG5 is an obligate psychrophilic bacterium that was first isolated on King George Island, Antarctica. Over the last 50 years, the West Antarctic, including King George Island, has been one of the most rapidly warming places on Earth, hence making it an excellent area to measure the resilience of living species in warmed areas exposed to the constantly changing environment due to climate change. This bacterium encodes a genome of approximately 5694 protein-coding genes. However, 35% of the gene models for this species are found to be hypothetical proteins (HP). In this study, three conserved HP genes of P. cryoconitis, designated pcbg5hp1, pcbg5hp2 and pcbg5hp12, were cloned and the proteins were expressed, purified and their functions and structures were evaluated. Real-time quantitative PCR analysis revealed that these genes were expressed constitutively, suggesting a potentially important role where the expression of these genes under an almost constant demand might have some regulatory functions in thermal stress tolerance. Functional analysis showed that these proteins maintained their activities at low and moderate temperatures. Meanwhile, a low citrate synthase aggregation at 43 °C in the presence of PCBG5HP1 suggested the characteristics of chaperone activity. Furthermore, our comparative structural analysis demonstrated that the HPs exhibited cold-adapted traits, most notably increased flexibility in their 3D structures compared to their counterparts. Concurrently, the presence of a disulphide bridge and aromatic clusters was attributed to PCBG5HP1’s unusual protein stability and chaperone activity. Thus, this suggested that the HPs examined in this study acquired strategies to maintain a balance between molecular stability and structural flexibility. Conclusively, this study has established the structure–function relationships of the HPs produced by P. cryoconitis and provided crucial experimental evidence indicating their importance in thermal stress response.
Collapse
Affiliation(s)
- Makdi Masnoddin
- Biotechnology Research Institute, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu 88400, Sabah, Malaysia
- Preparatory Centre for Science and Technology, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu 88400, Sabah, Malaysia
| | | | - Nur Athirah Yusof
- Biotechnology Research Institute, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu 88400, Sabah, Malaysia
- Correspondence:
| |
Collapse
|
16
|
Zou H, Yang F, Yin Z. Integrating multiple sequence features for identifying anticancer peptides. Comput Biol Chem 2022; 99:107711. [DOI: 10.1016/j.compbiolchem.2022.107711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Revised: 05/16/2022] [Accepted: 05/29/2022] [Indexed: 11/03/2022]
|
17
|
Zhang YH, Li ZD, Zeng T, Chen L, Huang T, Cai YD. Screening gene signatures for clinical response subtypes of lung transplantation. Mol Genet Genomics 2022; 297:1301-1313. [PMID: 35780439 DOI: 10.1007/s00438-022-01918-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Accepted: 06/12/2022] [Indexed: 11/30/2022]
Abstract
Lung is the most important organ in the human respiratory system, whose normal functions are quite essential for human beings. Under certain pathological conditions, the normal lung functions could no longer be maintained in patients, and lung transplantation is generally applied to ease patients' breathing and prolong their lives. However, several risk factors exist during and after lung transplantation, including bleeding, infection, and transplant rejections. In particular, transplant rejections are difficult to predict or prevent, leading to the most dangerous complications and severe status in patients undergoing lung transplantation. Given that most common monitoring and validation methods for lung transplantation rejections may take quite a long time and have low reproducibility, new technologies and methods are required to improve the efficacy and accuracy of rejection monitoring after lung transplantation. Recently, one previous study set up the gene expression profiles of patients who underwent lung transplantation. However, it did not provide a tool to predict lung transplantation responses. Here, a further deep investigation was conducted on such profiling data. A computational framework, incorporating several machine learning algorithms, such as feature selection methods and classification algorithms, was built to establish an effective prediction model distinguishing patient into different clinical subgroups, corresponding to different rejection responses after lung transplantation. Furthermore, the framework also screened essential genes with functional enrichments and create quantitative rules for the distinction of patients with different rejection responses to lung transplantation. The outcome of this contribution could provide guidelines for clinical treatment of each rejection subtype and contribute to the revealing of complicated rejection mechanisms of lung transplantation.
Collapse
Affiliation(s)
- Yu-Hang Zhang
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Zhan Dong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, 130052, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, China.
| |
Collapse
|
18
|
Support matrix machine with pinball loss for classification. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07460-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
19
|
Wang ZQ, Meng FZ, Yin LF, Yin WX, Lv L, Yang XL, Chang XQ, Zhang S, Luo CX. Transcriptomic Analysis of Resistant and Wild-Type Isolates Revealed Fludioxonil as a Candidate for Controlling the Emerging Isoprothiolane Resistant Populations of Magnaporthe oryzae. Front Microbiol 2022; 13:874497. [PMID: 35464942 PMCID: PMC9024399 DOI: 10.3389/fmicb.2022.874497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2022] [Accepted: 03/23/2022] [Indexed: 11/22/2022] Open
Abstract
The point mutation R343W in MoIRR, a putative Zn2Cys6 transcription factor, introduces isoprothiolane (IPT) resistance in Magnaporthe oryzae. However, the function of MoIRR has not been characterized. In this study, the function of MoIRR was investigated by subcellular localization observation, transcriptional autoactivation test, and transcriptomic analysis. As expected, GFP-tagged MoIRR was translocated in the nucleus, and its C-terminal could autonomously activate the expression of reporter genes HIS3 and α-galactosidase in absence of any prey proteins in Y2HGold, suggesting that MoIRR was a typical transcription factor. Transcriptomic analysis was then performed for resistant mutant 1a_mut (R343W), knockout transformant ΔMoIRR-1, and their parental wild-type isolate H08-1a. Upregulated genes in both 1a_mut and ΔMoIRR-1 were involved in fungicide resistance-related KEGG pathways, including the glycerophospholipid metabolism and Hog1 MAPK pathways. All MoIRR deficiency-related IPT-resistant strains exhibited increased susceptibility to fludioxonil (FLU) that was due to the upregulation of Hog1 MAPK pathway genes. The results indicated a correlation between FLU susceptibility and MoIRR deficiency-related IPT resistance in M. oryzae. Thus, using a mixture of IPT and FLU could be a strategy to manage the IPT-resistant populations of M. oryzae in rice fields.
Collapse
Affiliation(s)
- Zuo-Qian Wang
- Institute of Plant Protection and Soil Science, Hubei Academy of Agricultural Sciences, Wuhan, China
- Key Laboratory of Integrated Pest Management on Crops in Central China, Ministry of Agriculture, Wuhan, China
| | - Fan-Zhu Meng
- Department of Plant Pathology, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, China
- The Key Lab of Crop Disease Monitoring and Safety Control in Hubei Province, Huazhong Agricultural University, Wuhan, China
| | - Liang-Fen Yin
- Department of Plant Pathology, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, China
- The Key Lab of Crop Disease Monitoring and Safety Control in Hubei Province, Huazhong Agricultural University, Wuhan, China
| | - Wei-Xiao Yin
- Department of Plant Pathology, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, China
- The Key Lab of Crop Disease Monitoring and Safety Control in Hubei Province, Huazhong Agricultural University, Wuhan, China
| | - Liang Lv
- Institute of Plant Protection and Soil Science, Hubei Academy of Agricultural Sciences, Wuhan, China
- Key Laboratory of Integrated Pest Management on Crops in Central China, Ministry of Agriculture, Wuhan, China
| | - Xiao-Lin Yang
- Institute of Plant Protection and Soil Science, Hubei Academy of Agricultural Sciences, Wuhan, China
- Key Laboratory of Integrated Pest Management on Crops in Central China, Ministry of Agriculture, Wuhan, China
| | - Xiang-Qian Chang
- Institute of Plant Protection and Soil Science, Hubei Academy of Agricultural Sciences, Wuhan, China
- Key Laboratory of Integrated Pest Management on Crops in Central China, Ministry of Agriculture, Wuhan, China
| | - Shu Zhang
- Institute of Plant Protection and Soil Science, Hubei Academy of Agricultural Sciences, Wuhan, China
- Key Laboratory of Integrated Pest Management on Crops in Central China, Ministry of Agriculture, Wuhan, China
- *Correspondence: Shu Zhang,
| | - Chao-Xi Luo
- Department of Plant Pathology, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, China
- The Key Lab of Crop Disease Monitoring and Safety Control in Hubei Province, Huazhong Agricultural University, Wuhan, China
- Chao-Xi Luo,
| |
Collapse
|
20
|
Alkhadrawi AM, Wang Y, Li C. In-silico screening of potential target transporters for glycyrrhetinic acid (GA) via deep learning prediction of drug-target interactions. Biochem Eng J 2022. [DOI: 10.1016/j.bej.2022.108375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
21
|
Hooper CM, Castleden IR, Tanz SK, Grasso SV, Millar AH. Subcellular Proteomics as a Unified Approach of Experimental Localizations and Computed Prediction Data for Arabidopsis and Crop Plants. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2022; 1346:67-89. [PMID: 35113396 DOI: 10.1007/978-3-030-80352-0_4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
In eukaryotic organisms, subcellular protein location is critical in defining protein function and understanding sub-functionalization of gene families. Some proteins have defined locations, whereas others have low specificity targeting and complex accumulation patterns. There is no single approach that can be considered entirely adequate for defining the in vivo location of all proteins. By combining evidence from different approaches, the strengths and weaknesses of different technologies can be estimated, and a location consensus can be built. The Subcellular Location of Proteins in Arabidopsis database ( http://suba.live/ ) combines experimental data sets that have been reported in the literature and is analyzing these data to provide useful tools for biologists to interpret their own data. Foremost among these tools is a consensus classifier (SUBAcon) that computes a proposed location for all proteins based on balancing the experimental evidence and predictions. Further tools analyze sets of proteins to define the abundance of cellular structures. Extending these types of resources to plant crop species has been complex due to polyploidy, gene family expansion and contraction, and the movement of pathways and processes within cells across the plant kingdom. The Crop Proteins of Annotated Location database ( http://crop-pal.org/ ) has developed a range of subcellular location resources including a species-specific voting consensus for 12 plant crop species that offers collated evidence and filters for current crop proteomes akin to SUBA. Comprehensive cross-species comparison of these data shows that the sub-cellular proteomes (subcellulomes) depend only to some degree on phylogenetic relationship and are more conserved in major biosynthesis than in metabolic pathways. Together SUBA and cropPAL created reference subcellulomes for plants as well as species-specific subcellulomes for cross-species data mining. These data collections are increasingly used by the research community to provide a subcellular protein location layer, inform models of compartmented cell function and protein-protein interaction network, guide future molecular crop breeding strategies, or simply answer a specific question-where is my protein of interest inside the cell?
Collapse
Affiliation(s)
- Cornelia M Hooper
- The Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley, WA, Australia
| | - Ian R Castleden
- The Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley, WA, Australia
| | - Sandra K Tanz
- The Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley, WA, Australia
| | - Sally V Grasso
- The Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley, WA, Australia
| | - A Harvey Millar
- The Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley, WA, Australia.
| |
Collapse
|
22
|
Zhang Z, Gong Y, Gao B, Li H, Gao W, Zhao Y, Dong B. SNAREs-SAP: SNARE Proteins Identification With PSSM Profiles. Front Genet 2022; 12:809001. [PMID: 34987554 PMCID: PMC8721734 DOI: 10.3389/fgene.2021.809001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Accepted: 11/15/2021] [Indexed: 12/20/2022] Open
Abstract
Soluble N-ethylmaleimide sensitive factor activating protein receptor (SNARE) proteins are a large family of transmembrane proteins located in organelles and vesicles. The important roles of SNARE proteins include initiating the vesicle fusion process and activating and fusing proteins as they undergo exocytosis activity, and SNARE proteins are also vital for the transport regulation of membrane proteins and non-regulatory vesicles. Therefore, there is great significance in establishing a method to efficiently identify SNARE proteins. However, the identification accuracy of the existing methods such as SNARE CNN is not satisfied. In our study, we developed a method based on a support vector machine (SVM) that can effectively recognize SNARE proteins. We used the position-specific scoring matrix (PSSM) method to extract features of SNARE protein sequences, used the support vector machine recursive elimination correlation bias reduction (SVM-RFE-CBR) algorithm to rank the importance of features, and then screened out the optimal subset of feature data based on the sorted results. We input the feature data into the model when building the model, used 10-fold crossing validation for training, and tested model performance by using an independent dataset. In independent tests, the ability of our method to identify SNARE proteins achieved a sensitivity of 68%, specificity of 94%, accuracy of 92%, area under the curve (AUC) of 84%, and Matthew’s correlation coefficient (MCC) of 0.48. The results of the experiment show that the common evaluation indicators of our method are excellent, indicating that our method performs better than other existing classification methods in identifying SNARE proteins.
Collapse
Affiliation(s)
- Zixiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yue Gong
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Hongfei Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Wentao Gao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yuming Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Benzhi Dong
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|
23
|
Gull S, Minhas F. AMP 0: Species-Specific Prediction of Anti-microbial Peptides Using Zero and Few Shot Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:275-283. [PMID: 32750857 DOI: 10.1109/tcbb.2020.2999399] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Evolution of drug-resistant microbial species is one of the major challenges to global health. Development of new antimicrobial treatments such as antimicrobial peptides needs to be accelerated to combat this threat. However, the discovery of novel antimicrobial peptides is hampered by low-throughput biochemical assays. Computational techniques can be used for rapid screening of promising antimicrobial peptide candidates prior to testing in the wet lab. The vast majority of existing antimicrobial peptide predictors are non-targeted in nature, i.e., they can predict whether a given peptide sequence is antimicrobial, but they are unable to predict whether the sequence can target a particular microbial species. In this work, we have used zero and few shot machine learning to develop a targeted antimicrobial peptide activity predictor called AMP0. The proposed predictor takes the sequence of a peptide and any N/C-termini modifications together with the genomic sequence of a microbial species to generate targeted predictions. Cross-validation results show that the proposed scheme is particularly effective for targeted antimicrobial prediction in comparison to existing approaches and can be used for screening potential antimicrobial peptides in a targeted manner with only a small number of training examples for novel species. AMP0 webserver is available at http://ampzero.pythonanywhere.com.
Collapse
|
24
|
Performance Evaluation of Machine Learning Algorithms for Stock Price and Stock Index Movement Prediction Using Trend Deterministic Data Prediction. INTERNATIONAL JOURNAL OF APPLIED METAHEURISTIC COMPUTING 2022. [DOI: 10.4018/ijamc.292511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
This experimental study addresses the problem of predicting the direction of stocks and the movement of stock price indices for three major stocks and stock indices. The proposed approach for processing input data involves the computation of ten technical indicators using stock trading data. The dataset used for the evaluation of all the prediction models consists of 11 years of historical data from January 2007 to December 2017. The study comprises four prediction models which are Long Short-Term Memory, XGBoost, Support Vector Machine ( and Random forests. Accuracy scores and F1 scores for each of the prediction models have been evaluated using this input approach. Experimental results reveal that a continuous data approach using ten technical indicators gives the best performance in the case of the Random Forest classifier model with the highest accuracy of 84.89% (average wise 83.74%) and highest F1 score of 89.33% (average wise 83.74%). The experiments also give us an insight into why a Naïve Bayes Classification model is not a suitable prediction model for the above task.
Collapse
|
25
|
Lee C, Youn HJ, Lee SH, Kim J, Son D, Cha JY. Orchardgrass ACTIVATOR OF HSP90 ATPASE possesses autonomous chaperone properties and activates Hsp90 transcription to enhance thermotolerance. Biochem Biophys Res Commun 2022; 586:171-176. [PMID: 34856417 DOI: 10.1016/j.bbrc.2021.11.080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Accepted: 11/22/2021] [Indexed: 11/02/2022]
Abstract
High temperature stress is an environmental factor that negatively affects the growth and development of crops. Hsp90 (90 kDa heat shock protein) is a major molecular chaperone in eukaryotic cells, contributing to the maintenance of cell homeostasis through interaction with co-chaperones. Aha1 (activator of Hsp90 ATPase) is well known as a co-chaperone that activates ATPase activity of Hsp90 in mammals. However, biochemical and physiological evidence relating to Aha has not yet been identified in plants. In this study, we investigated the heat-tolerance function of orchardgrass (Dactylis glomerata L.) Aha (DgAha). Recombinant DgAha interacted with cytosolic DgHsp90s and efficiently protected substrates from thermal denaturation. Furthermore, heterologous expression of DgAha in yeast (Saccharomyces cerevisiae) cells and Arabidopsis (Arabidopsis thaliana) plants conferred thermotolerance in vivo. Enhanced expression of DgAha in Arabidopsis stimulates the transcription of Hsp90 under heat stress. Our data demonstrate that plant Aha plays a positive role in heat stress tolerance via chaperone properties and/or activation of Hsp90 to protect substrate proteins in plants from thermal injury.
Collapse
Affiliation(s)
- Changhoon Lee
- Department of Plant Medicine, Institute of Agriculture and Life Science, Gyeongsang National University, Jinju, 52828, Republic of Korea
| | - Ho Jin Youn
- Department of Plant Medicine, Institute of Agriculture and Life Science, Gyeongsang National University, Jinju, 52828, Republic of Korea
| | - Sang-Hoon Lee
- Grassland & Forages Division, National Institute of Animal Science, Rural Development Administration, Cheonan, 31000, Republic of Korea
| | - Jinwoo Kim
- Department of Plant Medicine, Institute of Agriculture and Life Science, Gyeongsang National University, Jinju, 52828, Republic of Korea
| | - Daeyoung Son
- Department of Plant Medicine, Institute of Agriculture and Life Science, Gyeongsang National University, Jinju, 52828, Republic of Korea.
| | - Joon-Yung Cha
- Research Institute of Life Sciences, Gyeongsang National University, Jinju, 52828, Republic of Korea.
| |
Collapse
|
26
|
Valizadeh M, Sohrabi M, Ameri Braki Z, Rashidi R, Pezeshkpur M. Investigation of spectrophotometric simultaneous absorption of Salmeterol and Fluticasone in Seroflo spray by continuous wavelet transform and radial basis function neural network methods. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2021; 263:120192. [PMID: 34314967 DOI: 10.1016/j.saa.2021.120192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Revised: 06/06/2021] [Accepted: 07/13/2021] [Indexed: 06/13/2023]
Abstract
In this research, the simultaneous absorption of Salmeterol (SAL) and Fluticasone (FLU) in Seroflo spray was investigated using a spectrophotometric device via employing continuous wavelet transform (CWT) and radial basis function neural network (RBF-NN) methods. Root mean square error (RMSE) related to the RBF model was obtained 3.17 × 10-13 and 1.41 × 10-13 for SAL and FLU, respectively. Limit of detection (LOD) and limit of quantification (LOQ) corresponding to the CWT method were 0.004, 0.280 μg/mL, and 0.431, 0.479 μg/mL for SAL and FLU, respectively. Root mean square error (RMSE) of SAL and FLU was obtained 3.17 × 10-13 and 1.41 × 10-13, respectively in RBF-NN method. In the end, the results obtained from all methods were compared with the high-performance liquid chromatography (HPLC) as a reference method. According to the one-way analysis of variance with a 95% confidence level, there is no significant difference between the proposed techniques and HPLC. Therefore, chemometrics methods are sufficiently accurate, as the reference method for the analysis of drugs. The suggested methods are simple, fast, and cheap. Also, there is no need for pre-preparation steps. These methods can be used for quality control laboratories in the pharmaceutical industry.
Collapse
Affiliation(s)
- Maryam Valizadeh
- Department of Chemistry, North Tehran Branch, Islamic Azad University, Tehran, Iran.
| | - Melika Sohrabi
- Faculty of Veterinary Medicine, Science and Research Branch, Islamic Azad University, Tehran, Iran
| | - Zahra Ameri Braki
- Department of Chemistry, North Tehran Branch, Islamic Azad University, Tehran, Iran
| | - Rashed Rashidi
- Faculty of Civil, Water and Environmental engineering, Shahid Beheshti University of Iran, Tehran, Iran
| | - Maryam Pezeshkpur
- Department of Chemistry, North Tehran Branch, Islamic Azad University, Tehran, Iran
| |
Collapse
|
27
|
Ullah M, Han K, Hadi F, Xu J, Song J, Yu DJ. PScL-HDeep: image-based prediction of protein subcellular location in human tissue using ensemble learning of handcrafted and deep learned features with two-layer feature selection. Brief Bioinform 2021; 22:bbab278. [PMID: 34337652 PMCID: PMC8574991 DOI: 10.1093/bib/bbab278] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2021] [Revised: 06/30/2021] [Accepted: 07/01/2021] [Indexed: 01/17/2023] Open
Abstract
Protein subcellular localization plays a crucial role in characterizing the function of proteins and understanding various cellular processes. Therefore, accurate identification of protein subcellular location is an important yet challenging task. Numerous computational methods have been proposed to predict the subcellular location of proteins. However, most existing methods have limited capability in terms of the overall accuracy, time consumption and generalization power. To address these problems, in this study, we developed a novel computational approach based on human protein atlas (HPA) data, referred to as PScL-HDeep, for accurate and efficient image-based prediction of protein subcellular location in human tissues. We extracted different handcrafted and deep learned (by employing pretrained deep learning model) features from different viewpoints of the image. The step-wise discriminant analysis (SDA) algorithm was applied to generate the optimal feature set from each original raw feature set. To further obtain a more informative feature subset, support vector machine-based recursive feature elimination with correlation bias reduction (SVM-RFE + CBR) feature selection algorithm was applied to the integrated feature set. Finally, the classification models, namely support vector machine with radial basis function (SVM-RBF) and support vector machine with linear kernel (SVM-LNR), were learned on the final selected feature set. To evaluate the performance of the proposed method, a new gold standard benchmark training dataset was constructed from the HPA databank. PScL-HDeep achieved the maximum performance on 10-fold cross validation test on this dataset and showed a better efficacy over existing predictors. Furthermore, we also illustrated the generalization ability of the proposed method by conducting a stringent independent validation test.
Collapse
Affiliation(s)
- Matee Ullah
- Nanjing University of Science and Technology, China
| | - Ke Han
- School of Computer Science and Engineering, Nanjing University of Science and Technology, China
| | - Fazal Hadi
- Pakistan Institute of Engineering and Applied Sciences, Islamabad, Pakistan
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, China
| |
Collapse
|
28
|
Bin Hafeez A, Jiang X, Bergen PJ, Zhu Y. Antimicrobial Peptides: An Update on Classifications and Databases. Int J Mol Sci 2021; 22:11691. [PMID: 34769122 PMCID: PMC8583803 DOI: 10.3390/ijms222111691] [Citation(s) in RCA: 166] [Impact Index Per Article: 41.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 10/24/2021] [Accepted: 10/25/2021] [Indexed: 02/06/2023] Open
Abstract
Antimicrobial peptides (AMPs) are distributed across all kingdoms of life and are an indispensable component of host defenses. They consist of predominantly short cationic peptides with a wide variety of structures and targets. Given the ever-emerging resistance of various pathogens to existing antimicrobial therapies, AMPs have recently attracted extensive interest as potential therapeutic agents. As the discovery of new AMPs has increased, many databases specializing in AMPs have been developed to collect both fundamental and pharmacological information. In this review, we summarize the sources, structures, modes of action, and classifications of AMPs. Additionally, we examine current AMP databases, compare valuable computational tools used to predict antimicrobial activity and mechanisms of action, and highlight new machine learning approaches that can be employed to improve AMP activity to combat global antimicrobial resistance.
Collapse
Affiliation(s)
- Ahmer Bin Hafeez
- Centre of Biotechnology and Microbiology, University of Peshawar, Peshawar 25120, Pakistan;
| | - Xukai Jiang
- Infection and Immunity Program, Department of Microbiology, Biomedicine Discovery Institute, Monash University, Clayton, VIC 3800, Australia; (X.J.); (P.J.B.)
- National Glycoengineering Research Center, Shandong University, Qingdao 266237, China
| | - Phillip J. Bergen
- Infection and Immunity Program, Department of Microbiology, Biomedicine Discovery Institute, Monash University, Clayton, VIC 3800, Australia; (X.J.); (P.J.B.)
| | - Yan Zhu
- Infection and Immunity Program, Department of Microbiology, Biomedicine Discovery Institute, Monash University, Clayton, VIC 3800, Australia; (X.J.); (P.J.B.)
| |
Collapse
|
29
|
Jiang Y, Wang D, Wang W, Xu D. Computational methods for protein localization prediction. Comput Struct Biotechnol J 2021; 19:5834-5844. [PMID: 34765098 PMCID: PMC8564054 DOI: 10.1016/j.csbj.2021.10.023] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 10/12/2021] [Accepted: 10/13/2021] [Indexed: 12/16/2022] Open
Abstract
The accurate annotation of protein localization is crucial in understanding protein function in tandem with a broad range of applications such as pathological analysis and drug design. Since most proteins do not have experimentally-determined localization information, the computational prediction of protein localization has been an active research area for more than two decades. In particular, recent machine-learning advancements have fueled the development of new methods in protein localization prediction. In this review paper, we first categorize the main features and algorithms used for protein localization prediction. Then, we summarize a list of protein localization prediction tools in terms of their coverage, characteristics, and accessibility to help users find suitable tools based on their needs. Next, we evaluate some of these tools on a benchmark dataset. Finally, we provide an outlook on the future exploration of protein localization methods.
Collapse
Affiliation(s)
- Yuexu Jiang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Duolin Wang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Weiwei Wang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| |
Collapse
|
30
|
Aoki T, Takadama K, Sato H. Adaptive Synapse Arrangement in Cortical Learning Algorithm. JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS 2021. [DOI: 10.20965/jaciii.2021.p0450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The cortical learning algorithm (CLA) is a time-series data prediction method that is designed based on the human neocortex. The CLA has multiple columns that are associated with the input data bits by synapses. The input data is then converted into an internal column representation based on the synapse relation. Because the synapse relation between the columns and input data bits is fixed during the entire prediction process in the conventional CLA, it cannot adapt to input data biases. Consequently, columns not used for internal representations arise, resulting in a low prediction accuracy in the conventional CLA. To improve the prediction accuracy of the CLA, we propose a CLA that self-adaptively arranges the column synapses according to the input data tendencies and verify its effectiveness with several artificial time-series data and real-world electricity load prediction data from New York City. Experimental results show that the proposed CLA achieves higher prediction accuracy than the conventional CLA and LSTMs with different network optimization algorithms by arranging column synapses according to the input data tendency.
Collapse
|
31
|
Ahmad F, Ikram S, Ahmad J, Ullah W, Hassan F, Khattak SU, Irshad Ur Rehman. GASPIDs Versus Non-GASPIDs - Differentiation Based on Machine Learning Approach. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200425225729] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Peptidases are a group of enzymes which catalyze the cleavage of peptide
bonds. Around 2-3% of the whole genome codes for proteases and about one-third of all known
proteases are serine proteases which are divided into 13 clans and 40 families. They are involved
in diverse physiological roles such as digestion, coagulation of blood, fibrinolysis, processing of
proteins and prohormones, signaling pathways, complement fixation, and have a vital role in the
immune defense system. Based on their functions, they can broadly be divided into two classes;
GASPIDs (Granule Associated Serine Peptidases involved in Immune Defense System) and Non-
GASPIDs. GASPIDs, in particular are involved in immune-associated functions i.e. initiating
apoptosis to kill virally infected and cancerous cells, cytokine modulation for the generation of
inflammatory responses, and direct killing of pathogens through phagosomes.
Methods:
In this study, sequence-based characterization of these two types of serine proteases is
performed. We first identified sequences by analyzing multiple online databases as well as by
analyzing whole genomes of different species from different orthologous and non-orthologous
species. Sequences were identified by devising a distinct criterion to differentiate GASPIDs from
Non-GASPIDs. The translated version of these sequences was then subjected to feature extraction.
Using these distinctive features, we differentiated GASPIDs from Non-GASPIDs by applying
multiple supervised machine learning models.
Results and Conclusion:
Our results show that, among the three classifiers used in this study,
SVM classifier coupled with tripeptide as feature method has shown the best accuracy in
classification of sequences as GASPIDs and Non-GASPIDs.
Collapse
Affiliation(s)
- Fawad Ahmad
- Centre of Biotechnology & Microbiology, University of Peshawar, Peshawar, Pakistan
| | - Saima Ikram
- Centre of Biotechnology & Microbiology, University of Peshawar, Peshawar, Pakistan
| | - Jamshaid Ahmad
- Centre of Biotechnology & Microbiology, University of Peshawar, Peshawar, Pakistan
| | - Waseem Ullah
- College of Software Convergence, Sejong University, Seoul, South Korea
| | - Fahad Hassan
- Centre of Biotechnology & Microbiology, University of Peshawar, Peshawar, Pakistan
| | - Saeed Ullah Khattak
- Centre of Biotechnology & Microbiology, University of Peshawar, Peshawar, Pakistan
| | - Irshad Ur Rehman
- Centre of Biotechnology & Microbiology, University of Peshawar, Peshawar, Pakistan
| |
Collapse
|
32
|
Kaur H, Kalia M, Singh V, Modgil V, Mohan B, Taneja N. In silico identification and characterization of promising drug targets in highly virulent uropathogenic Escherichia coli strain CFT073 by protein-protein interaction network analysis. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100704] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
|
33
|
Semwal R, Varadwaj PK. HumDLoc: Human Protein Subcellular Localization Prediction Using Deep Neural Network. Curr Genomics 2020; 21:546-557. [PMID: 33214771 PMCID: PMC7604748 DOI: 10.2174/1389202921999200528160534] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Revised: 03/27/2020] [Accepted: 03/30/2020] [Indexed: 11/24/2022] Open
Abstract
Aims To develop a tool that can annotate subcellular localization of human proteins. Background With the progression of high throughput human proteomics projects, an enormous amount of protein sequence data has been discovered in the recent past. All these raw sequence data require precise mapping and annotation for their respective biological role and functional attributes. The functional characteristics of protein molecules are highly dependent on the subcellular localization/compartment. Therefore, a fully automated and reliable protein subcellular localization prediction system would be very useful for current proteomic research. Objective To develop a machine learning-based predictive model that can annotate the subcellular localization of human proteins with high accuracy and precision. Methods In this study, we used the PSI-CD-HIT homology criterion and utilized the sequence-based features of protein sequences to develop a powerful subcellular localization predictive model. The dataset used to train the HumDLoc model was extracted from a reliable data source, Uniprot knowledge base, which helps the model to generalize on the unseen dataset. Results The proposed model, HumDLoc, was compared with two of the most widely used techniques: CELLO and DeepLoc, and other machine learning-based tools. The result demonstrated promising predictive performance of HumDLoc model based on various machine learning parameters such as accuracy (≥97.00%), precision (≥0.86), recall (≥0.89), MCC score (≥0.86), ROC curve (0.98 square unit), and precision-recall curve (0.93 square unit). Conclusion In conclusion, HumDLoc was able to outperform several alternative tools for correctly predicting subcellular localization of human proteins. The HumDLoc has been hosted as a web-based tool at https://bioserver.iiita.ac.in/HumDLoc/.
Collapse
Affiliation(s)
- Rahul Semwal
- 1Department of Information Technology (Bioinformatics), Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India; 2Department of Bioinformatics and Applied Science, Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India
| | - Pritish Kumar Varadwaj
- 1Department of Information Technology (Bioinformatics), Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India; 2Department of Bioinformatics and Applied Science, Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India
| |
Collapse
|
34
|
Antonakoudis A, Barbosa R, Kotidis P, Kontoravdi C. The era of big data: Genome-scale modelling meets machine learning. Comput Struct Biotechnol J 2020; 18:3287-3300. [PMID: 33240470 PMCID: PMC7663219 DOI: 10.1016/j.csbj.2020.10.011] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 10/07/2020] [Accepted: 10/08/2020] [Indexed: 12/15/2022] Open
Abstract
With omics data being generated at an unprecedented rate, genome-scale modelling has become pivotal in its organisation and analysis. However, machine learning methods have been gaining ground in cases where knowledge is insufficient to represent the mechanisms underlying such data or as a means for data curation prior to attempting mechanistic modelling. We discuss the latest advances in genome-scale modelling and the development of optimisation algorithms for network and error reduction, intracellular constraining and applications to strain design. We further review applications of supervised and unsupervised machine learning methods to omics datasets from microbial and mammalian cell systems and present efforts to harness the potential of both modelling approaches through hybrid modelling.
Collapse
Affiliation(s)
| | | | | | - Cleo Kontoravdi
- Department of Chemical Engineering, Imperial College London, London SW7 2AZ, United Kingdom
| |
Collapse
|
35
|
Bian H, Guo M, Wang J. Recognition of Mitochondrial Proteins in Plasmodium Based on the Tripeptide Composition. Front Cell Dev Biol 2020; 8:578901. [PMID: 33043014 PMCID: PMC7525148 DOI: 10.3389/fcell.2020.578901] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 08/13/2020] [Indexed: 01/31/2023] Open
Abstract
Mitochondria play essential roles in eukaryotic cells, especially in Plasmodium cells. They have several unusual evolutionary and functional features that are incredibly vital for disease diagnosis and drug design. Thus, predicting mitochondrial proteins of Plasmodium has become a worthwhile work. However, existing computational methods can only predict mitochondrial proteins of Plasmodium falciparum (P. falciparum for short), and these methods have low accuracy. It is highly desirable to design a classifier with high accuracy for predicting mitochondrial proteins for all Plasmodium species, not only P. falciparum. We proposed a novel method, named as PM-OTC, for predicting mitochondrial proteins in Plasmodium. PM-OTC uses the Support Vector Machine (SVM) as the classifier and the selected tripeptide composition as the features. We adopted the 5-fold cross-validation method to train and test PM-OTC. Results demonstrate that PM-OTC achieves an accuracy of 94.91%, and performances of PM-OTC are superior to other methods.
Collapse
Affiliation(s)
- Haodong Bian
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China.,Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing, China
| | - Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, China.,Stage Key Laboratories of Reproductive Regulation & Breeding of Grassland Livestock, Hohhot, China
| |
Collapse
|
36
|
Zhang YH, Pan X, Zeng T, Chen L, Huang T, Cai YD. Identifying the RNA signatures of coronary artery disease from combined lncRNA and mRNA expression profiles. Genomics 2020; 112:4945-4958. [PMID: 32919019 DOI: 10.1016/j.ygeno.2020.09.016] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2020] [Revised: 07/28/2020] [Accepted: 09/05/2020] [Indexed: 12/23/2022]
Abstract
Coronary artery disease (CAD) is the most common cardiovascular disease. CAD research has greatly progressed during the past decade. mRNA is a traditional and popular pipeline to investigate various disease, including CAD. Compared with mRNA, lncRNA has better stability and thus may serve as a better disease indicator in blood. Investigating potential CAD-related lncRNAs and mRNAs will greatly contribute to the diagnosis and treatment of CAD. In this study, a computational analysis was conducted on patients with CAD by using a comprehensive transcription dataset with combined mRNA and lncRNA expression data. Several machine learning algorithms, including feature selection methods and classification algorithms, were applied to screen for the most CAD-related RNA molecules. Decision rules were also reported to provide a quantitative description about the effect of these RNA molecules on CAD progression. These new findings (CAD-related RNA molecules and rules) can help understand mRNA and lncRNA expression levels in CAD.
Collapse
Affiliation(s)
- Yu-Hang Zhang
- School of Life Sciences, Shanghai University, Shanghai 200444, China; Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China; Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China.
| | - Tao Zeng
- Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, Shanghai 201210, China.
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| |
Collapse
|
37
|
Adabor ES, Acquaah-Mensah GK, Mazandu GK. MSclassifier: median-supplement model-based classification tool for automated knowledge discovery. F1000Res 2020; 9:1114. [PMID: 33456763 PMCID: PMC7788522 DOI: 10.12688/f1000research.25501.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/03/2020] [Indexed: 11/20/2022] Open
Abstract
High-throughput technologies have resulted in an exponential growth of publicly available and accessible datasets for biomedical research. Efficient computational models, algorithms and tools are required to exploit the datasets for knowledge discovery to aid medical decisions. Here, we introduce a new tool, MSclassifier, based on median-supplement approaches to machine learning to enable an automated and effective binary classification for optimal decision making. The MSclassifier package estimates medians of features (attributes) to deduce supplementary data, which is subsequently introduced into the training set for balancing and building superior models for classification. To test our approach, it is used to determine HER2 receptor expression status phenotypes in breast cancer and also predict protein subcellular localization (plasma membrane and nucleus). Using independent sample and cross-validation tests, the performance of MSclassifier is evaluated and compared with well established tools that could perform such tasks. In the HER2 receptor expression status phenotype identification tasks, MSclassifier achieved statistically significant higher classification rates than the best performing existing tool (90.30% versus 89.83%, p=8.62e-3). In the subcellular localization prediction tasks, MSclassifier and one other existing tool achieved equally high performances (93.42% versus 93.19%, p=0.06) although they both outperformed tools based on Naive Bayes classifiers. Overall, the application and evaluation of MSclassifier reveal its potential to be applied to varieties of binary classification problems. The MSclassifier package provides an R-portable and user-friendly application to a broad audience, enabling experienced end-users as well as non-programmers to perform an effective classification in biomedical and other fields of study.
Collapse
Affiliation(s)
- Emmanuel S. Adabor
- School of Technology, Ghana Institute of Management and Public Administration, Accra, Ghana
| | - George K. Acquaah-Mensah
- Pharmaceutical Sciences Department, Massachusetts College of Pharmacy and Health Sciences, Worcester, MA, USA
| | - Gaston K. Mazandu
- African Institute for Mathematical Sciences and Computational Biology Division, Department of Integrative Biomedical Sciences, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Cape Town, South Africa
| |
Collapse
|
38
|
Lunin S, Khrenov M, Glushkova O, Parfenyuk S, Novoselova T, Novoselova E. Precursors of thymic peptides as stress sensors. Expert Opin Biol Ther 2020; 20:1461-1475. [PMID: 32700610 DOI: 10.1080/14712598.2020.1800636] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
INTRODUCTION A large volume of data indicates that the known thymic hormones, thymulin, thymopoietin, thymosin-α, thymosin-β, and thymic humoral factor-y2, exhibit different spectra of activities. Although large in volume, available data are rather fragmented, resulting in a lack of understanding of the role played by thymic hormones in immune homeostasis. AREA COVERED Existing data compartmentalizes the effect of thymic peptides into 2 categories: influence on immune cells and interconnection with neuroendocrine systems. The current study draws attention to a third aspect of the thymic peptide effect that has not been clarified yet, wherein ubiquitous and highly abundant intranuclear precursors of so called 'thymic peptides' play a fundamental role in all somatic cells. EXPERT OPINION Our analysis indicated that, under certain stress-related conditions, these precursors are cleaved to form immunologically active peptides that rapidly leave the nucleus and intracellular spaces, to send 'distress signals' to the immune system, thereby acting as stress sensors. We propose that these peptides may form a link between somatic cells and immune as well as neuroendocrine systems. This model may provide a better understanding of the mechanisms underlying immune homeostasis, leading thereby to the development of new therapeutic regimes utilizing the characteristics of thymic peptides.
Collapse
Affiliation(s)
- Sergey Lunin
- Laboratory of Reception Mechanisms, Institute of Cell Biophysics of the Russian Academy of Sciences, PSCBR RAS , Pushchino, Russia
| | - Maxim Khrenov
- Laboratory of Reception Mechanisms, Institute of Cell Biophysics of the Russian Academy of Sciences, PSCBR RAS , Pushchino, Russia
| | - Olga Glushkova
- Laboratory of Reception Mechanisms, Institute of Cell Biophysics of the Russian Academy of Sciences, PSCBR RAS , Pushchino, Russia
| | - Svetlana Parfenyuk
- Laboratory of Reception Mechanisms, Institute of Cell Biophysics of the Russian Academy of Sciences, PSCBR RAS , Pushchino, Russia
| | - Tatyana Novoselova
- Laboratory of Reception Mechanisms, Institute of Cell Biophysics of the Russian Academy of Sciences, PSCBR RAS , Pushchino, Russia
| | - E Novoselova
- Laboratory of Reception Mechanisms, Institute of Cell Biophysics of the Russian Academy of Sciences, PSCBR RAS , Pushchino, Russia
| |
Collapse
|
39
|
Bouziane H, Chouarfia A. Use of Chou's 5-steps rule to predict the subcellular localization of gram-negative and gram-positive bacterial proteins by multi-label learning based on gene ontology annotation and profile alignment. J Integr Bioinform 2020; 18:51-79. [PMID: 32598314 PMCID: PMC8035964 DOI: 10.1515/jib-2019-0091] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2019] [Accepted: 04/08/2020] [Indexed: 12/31/2022] Open
Abstract
To date, many proteins generated by large-scale genome sequencing projects are still uncharacterized and subject to intensive investigations by both experimental and computational means. Knowledge of protein subcellular localization (SCL) is of key importance for protein function elucidation. However, it remains a challenging task, especially for multiple sites proteins known to shuttle between cell compartments to perform their proper biological functions and proteins which do not have significant homology to proteins of known subcellular locations. Due to their low-cost and reasonable accuracy, machine learning-based methods have gained much attention in this context with the availability of a plethora of biological databases and annotated proteins for analysis and benchmarking. Various predictive models have been proposed to tackle the SCL problem, using different protein sequence features pertaining to the subcellular localization, however, the overwhelming majority of them focuses on single localization and cover very limited cellular locations. The prediction was basically established on sorting signals, amino acids compositions, and homology. To improve the prediction quality, focus is actually on knowledge information extracted from annotation databases, such as protein-protein interactions and Gene Ontology (GO) functional domains annotation which has been recently a widely adopted and essential information for learning systems. To deal with such problem, in the present study, we considered SCL prediction task as a multi-label learning problem and tried to label both single site and multiple sites unannotated bacterial protein sequences by mining proteins homology relationships using both GO terms of protein homologs and PSI-BLAST profiles. The experiments using 5-fold cross-validation tests on the benchmark datasets showed a significant improvement on the results obtained by the proposed consensus multi-label prediction model which discriminates six compartments for Gram-negative and five compartments for Gram-positive bacterial proteins.
Collapse
Affiliation(s)
- Hafida Bouziane
- Département d’Informatique, Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf, USTO-MB BP 1505, El M’Naouer, 31000, Oran, Algeria
| | - Abdallah Chouarfia
- Département d’Informatique, Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf, USTO-MB BP 1505, El M’Naouer, 31000, Oran, Algeria
| |
Collapse
|
40
|
Predictions of Apoptosis Proteins by Integrating Different Features Based on Improving Pseudo-Position-Specific Scoring Matrix. BIOMED RESEARCH INTERNATIONAL 2020; 2020:4071508. [PMID: 32420339 PMCID: PMC7201498 DOI: 10.1155/2020/4071508] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Accepted: 12/19/2019] [Indexed: 11/25/2022]
Abstract
Apoptosis proteins are strongly related to many diseases and play an indispensable role in maintaining the dynamic balance between cell death and division in vivo. Obtaining localization information on apoptosis proteins is necessary in understanding their function. To date, few researchers have focused on the problem of apoptosis data imbalance before classification, while this data imbalance is prone to misclassification. Therefore, in this work, we introduce a method to resolve this problem and to enhance prediction accuracy. Firstly, the features of the protein sequence are captured by combining Improving Pseudo-Position-Specific Scoring Matrix (IM-Psepssm) with the Bidirectional Correlation Coefficient (Bid-CC) algorithm from position-specific scoring matrix. Secondly, different features of fusion and resampling strategies are used to reduce the impact of imbalance on apoptosis protein datasets. Finally, the eigenvector adopts the Support Vector Machine (SVM) to the training classification model, and the prediction accuracy is evaluated by jackknife cross-validation tests. The experimental results indicate that, under the same feature vector, adopting resampling methods remarkably boosts many significant indicators in the unsampling method for predicting the localization of apoptosis proteins in the ZD98, ZW225, and CL317 databases. Additionally, we also present new user-friendly local software for readers to apply; the codes and software can be freely accessed at https://github.com/ruanxiaoli/Im-Psepssm.
Collapse
|
41
|
Zhao Y, Wang J, Chen J, Zhang X, Guo M, Yu G. A Literature Review of Gene Function Prediction by Modeling Gene Ontology. Front Genet 2020; 11:400. [PMID: 32391061 PMCID: PMC7193026 DOI: 10.3389/fgene.2020.00400] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/30/2020] [Indexed: 12/14/2022] Open
Abstract
Annotating the functional properties of gene products, i.e., RNAs and proteins, is a fundamental task in biology. The Gene Ontology database (GO) was developed to systematically describe the functional properties of gene products across species, and to facilitate the computational prediction of gene function. As GO is routinely updated, it serves as the gold standard and main knowledge source in functional genomics. Many gene function prediction methods making use of GO have been proposed. But no literature review has summarized these methods and the possibilities for future efforts from the perspective of GO. To bridge this gap, we review the existing methods with an emphasis on recent solutions. First, we introduce the conventions of GO and the widely adopted evaluation metrics for gene function prediction. Next, we summarize current methods of gene function prediction that apply GO in different ways, such as using hierarchical or flat inter-relationships between GO terms, compressing massive GO terms and quantifying semantic similarities. Although many efforts have improved performance by harnessing GO, we conclude that there remain many largely overlooked but important topics for future research.
Collapse
Affiliation(s)
- Yingwen Zhao
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jian Chen
- State Key Laboratory of Agrobiotechnology and National Maize Improvement Center, China Agricultural University, Beijing, China
| | - Xiangliang Zhang
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing, China
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
42
|
The Order-Disorder Continuum: Linking Predictions of Protein Structure and Disorder through Molecular Simulation. Sci Rep 2020; 10:2068. [PMID: 32034199 PMCID: PMC7005769 DOI: 10.1038/s41598-020-58868-w] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2019] [Accepted: 10/16/2019] [Indexed: 12/11/2022] Open
Abstract
Intrinsically disordered proteins (IDPs) and intrinsically disordered regions within proteins (IDRs) serve an increasingly expansive list of biological functions, including regulation of transcription and translation, protein phosphorylation, cellular signal transduction, as well as mechanical roles. The strong link between protein function and disorder motivates a deeper fundamental characterization of IDPs and IDRs for discovering new functions and relevant mechanisms. We review recent advances in experimental techniques that have improved identification of disordered regions in proteins. Yet, experimentally curated disorder information still does not currently scale to the level of experimentally determined structural information in folded protein databases, and disorder predictors rely on several different binary definitions of disorder. To link secondary structure prediction algorithms developed for folded proteins and protein disorder predictors, we conduct molecular dynamics simulations on representative proteins from the Protein Data Bank, comparing secondary structure and disorder predictions with simulation results. We find that structure predictor performance from neural networks can be leveraged for the identification of highly dynamic regions within molecules, linked to disorder. Low accuracy structure predictions suggest a lack of static structure for regions that disorder predictors fail to identify. While disorder databases continue to expand, secondary structure predictors and molecular simulations can improve disorder predictor performance, which aids discovery of novel functions of IDPs and IDRs. These observations provide a platform for the development of new, integrated structural databases and fusion of prediction tools toward protein disorder characterization in health and disease.
Collapse
|
43
|
Singh LK, Khanna M, Garg H. Multimodal Biometric Based on Fusion of Ridge Features with Minutiae Features and Face Features. INTERNATIONAL JOURNAL OF INFORMATION SYSTEM MODELING AND DESIGN 2020. [DOI: 10.4018/ijismd.2020010103] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Multimodal biometrics refers to the exploiting combination of two or more biometric modalities in an identification of a system. Fingerprint, face, retina, iris, hand geometry, DNA, and palm print are physiological traits while voice, signature, keystrokes, gait are behavioural traits used for identification by a system. Single biometric features like faces, fingerprints, irises, retinas, etc., deteriorate or change with time, environment, user mode, physiological defects, and circumstance therefore integrating multi features of biometric traits increase robustness of the system. The proposed multimodal biometrics system presents recognition based on face detection and fingerprint physiological traits. This proposed system increases the efficiency, accuracy and decreases execution time of the system as compared to the existing systems. The performance of proposed method is reported in terms of parameters such as False Rejection Rate (FRR), False Acceptance Rate (FAR) and Equal Error Rate (EER) and accuracy is reported at 95.389%.
Collapse
Affiliation(s)
| | - Munish Khanna
- Hindustan College of Science and Technology, Mathura, India
| | | |
Collapse
|
44
|
Nithya V. SubmitoLoc: Identification of mitochondrial sub cellular locations of proteins using support vector machine. Bioinformation 2019; 15:863-868. [PMID: 32256006 PMCID: PMC7088428 DOI: 10.6026/97320630015863] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2019] [Revised: 12/31/2019] [Accepted: 12/31/2019] [Indexed: 11/23/2022] Open
Abstract
Mitochondria are important sub-cellular organelles in eukaryotes. Defects in mitochondrial system lead to a variety of disease. Therefore, detailed knowledge of mitochondrial proteome is vital to understand mitochondrial system and their function. Sequence databases contain large number of mitochondrial proteins but they are mostly not annotated. In this study, we developed a support vector machine approach, SubmitoLoc, to predict mitochondrial sub cellular locations of proteins based on various sequence derived properties. We evaluated the predictor using 10-fold cross validation. Our method achieved 88.56 % accuracy using all features. Average sensitivity and specificity for four-subclass prediction is 85.37% and 87.25% respectively. High prediction accuracy suggests that SubmitoLoc will be useful for researchers studying mitochondrial biology and drug discovery.
Collapse
Affiliation(s)
- Varadharaju Nithya
- Department of Animal Health Management, Alagappa University, Karaikudi-630003, India
| |
Collapse
|
45
|
Sun S, Wang C, Ding H, Zou Q. Machine learning and its applications in plant molecular studies. Brief Funct Genomics 2019; 19:40-48. [DOI: 10.1093/bfgp/elz036] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2019] [Revised: 09/06/2019] [Accepted: 09/15/2019] [Indexed: 01/16/2023] Open
Abstract
Abstract
The advent of high-throughput genomic technologies has resulted in the accumulation of massive amounts of genomic information. However, biologists are challenged with how to effectively analyze these data. Machine learning can provide tools for better and more efficient data analysis. Unfortunately, because many plant biologists are unfamiliar with machine learning, its application in plant molecular studies has been restricted to a few species and a limited set of algorithms. Thus, in this study, we provide the basic steps for developing machine learning frameworks and present a comprehensive overview of machine learning algorithms and various evaluation metrics. Furthermore, we introduce sources of important curated plant genomic data and R packages to enable plant biologists to easily and quickly apply appropriate machine learning algorithms in their research. Finally, we discuss current applications of machine learning algorithms for identifying various genes related to resistance to biotic and abiotic stress. Broad application of machine learning and the accumulation of plant sequencing data will advance plant molecular studies.
Collapse
Affiliation(s)
- Shanwen Sun
- University of Bayreuth in Germany. He is now a postdoctoral fellow at the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China
| | - Chunyu Wang
- Harbin Institute of Technology in China. He is an associate professor in the School of Computer Science and Technology, Harbin Institute of Technology
| | - Hui Ding
- Inner Mongolia University in China. She is an associate professor in the Center for Informational Biology, University of Electronic Science and Technology of China
| | - Quan Zou
- Harbin Institute of Technology in China. He is a professor in the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China
| |
Collapse
|
46
|
Yu D, Xu Z, Wang X. Bibliometric analysis of support vector machines research trend: a case study in China. INT J MACH LEARN CYB 2019. [DOI: 10.1007/s13042-019-01028-y] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
47
|
SDBP-Pred: Prediction of single-stranded and double-stranded DNA-binding proteins by extending consensus sequence and K-segmentation strategies into PSSM. Anal Biochem 2019; 589:113494. [PMID: 31693872 DOI: 10.1016/j.ab.2019.113494] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Revised: 10/24/2019] [Accepted: 10/31/2019] [Indexed: 11/24/2022]
Abstract
Identification of DNA-binding proteins (DNA-BPs) is a hot issue in protein science due to its key role in various biological processes. These processes are highly concerned with DNA-binding protein types. DNA-BPs are classified into single-stranded DNA-binding proteins (SSBs) and double-stranded DNA-binding proteins (DSBs). SSBs mainly involved in DNA recombination, replication, and repair, while DSBs regulate transcription process, DNA cleavage, and chromosome packaging. In spite of the aforementioned significance, few methods have been proposed for discrimination of SSBs and DSBs. Therefore, more predictors with favorable performance are indispensable. In this work, we present an innovative predictor, called SDBP-Pred with a novel feature descriptor, named consensus sequence-based K-segmentation position-specific scoring matrix (CSKS-PSSM). We encoded the local discriminative features concealed in PSSM via K-segmentation strategy and the global potential features by applying the notion of the consensus sequence. The obtained feature vector then input to support vector machine (SVM) with linear, polynomial and radial base function (RBF) kernels. Our model with SVM-RBF achieved the highest accuracies on three tests namely jackknife, 10-fold, and independent tests, respectively than the recent method. The obtained prediction results illustrate the superlative prediction performance of SDBP-Pred over existing studies in the literature so far.
Collapse
|
48
|
Identification of a Ribosomal Protein RpsB as a Surface-Exposed Protein and Adhesin of Rickettsia heilongjiangensis. BIOMED RESEARCH INTERNATIONAL 2019; 2019:9297129. [PMID: 31360728 PMCID: PMC6652061 DOI: 10.1155/2019/9297129] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 06/18/2019] [Accepted: 06/19/2019] [Indexed: 11/26/2022]
Abstract
Rickettsia heilongjiangensis is an obligate intracellular bacterium that is responsible for far-eastern spotted fever. Surface-exposed proteins (SEPs) play important roles in its pathogenesis. Previous work identified a ribosomal protein RpsB as an SEP by biotin-avidin affinity, a seroreactive antigen, and a diagnostic candidate protein, indicating that it might play an important role in the pathogenesis of rickettsiae. However, in the absence of other evidence, its subcellular location of being surface-exposed was puzzling because ribosomal proteins are located in the cytoplasm. In the present study, the subcellular location of RpsB was analyzed with bioinformatics tools coupled with immunoelectron microscopy. The adhesion ability of RpsB was evaluated by protein microarray and cellular ELISA. Consequently, different bioinformatics tools gave different location predication results. Thus, RpsB was found in the cytoplasma and inner and outer membranes of R. heilongjiangensis by transmission electron microscopy. Protein microarray and cellular ELISA showed that RpsB binds to the host cell surface and its adhesion ability was even stronger than the known adhesin Adr1. In conclusion, RpsB was visually and directly shown for the time to be an SEP of rickettsiae and might be an important ligand and adhesin of rickettsiae. Its roles in pathogenesis warrant further study.
Collapse
|
49
|
Yao Y, Li M, Xu H, Yan S, He P, Dai Q, Qi Z, Liao B. Protein Subcellular Localization Prediction based on PSI-BLAST Profile and Principal Component Analysis. CURR PROTEOMICS 2019. [DOI: 10.2174/1570164616666190126155744] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Prediction of protein subcellular location is a meaningful task which attracts
much attention in recent years. Particularly, the number of new protein sequences yielded by the highthroughput
sequencing technology in the post genomic era has increased explosively.
Objective:
Protein subcellular localization prediction based solely on sequence data remains to be a
challenging problem of computational biology.
Methods:
In this paper, three sets of evolutionary features are derived from the position-specific scoring
matrix, which has shown great potential in other bioinformatics problems. A fusion model is built
up by the optimal parameters combination. Finally, principal component analysis and support vector
machine classifier is applied to predict protein subcellular localization on NNPSL dataset and Cell-
PLoc 2.0 dataset.
Results:
Our experimental results show that the proposed method remarkably improved the prediction
accuracy, and the features derived from PSI-BLAST profile only are appropriate for protein subcellular
localization prediction.
Collapse
Affiliation(s)
- Yuhua Yao
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China
| | - Manzhi Li
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China
| | - Huimin Xu
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Shoujiang Yan
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Pingan He
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Zhaohui Qi
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang 050043, China
| | - Bo Liao
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China
| |
Collapse
|
50
|
Man K, Harring JR, Sinharay S. Use of Data Mining Methods to Detect Test Fraud. JOURNAL OF EDUCATIONAL MEASUREMENT 2019. [DOI: 10.1111/jedm.12208] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|