1
|
Yan W, Yu F, Tan L, Mengshan L, Xiaojun X, Weihong Z, Sheng S, Jun W, Fu-An W. A hybrid machine learning model with attention mechanism and multidimensional multivariate feature coding for essential gene prediction. BMC Biol 2025; 23:108. [PMID: 40275343 PMCID: PMC12023577 DOI: 10.1186/s12915-025-02209-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 04/07/2025] [Indexed: 04/26/2025] Open
Abstract
BACKGROUND Essential genes are crucial for the development, inheritance, and survival of species. The exploration of these genes can unravel the complex mechanisms and fundamental life processes and identify potential therapeutic targets for various diseases. Therefore, the identification of essential genes is significant. Machine learning has become the mainstream approach for essential gene prediction. However, some key challenges in machine learning need to be addressed, such as the extraction of genetic features, the impact of imbalanced data, and the cross-species generalization ability. RESULTS Here, we proposed a hybrid machine learning model based on graph convolutional neural networks (GCN) and bi-directional long short-term memory (Bi-LSTM) with attention mechanism and multidimensional multivariate feature coding for essential gene prediction, called EGP Hybrid-ML. In the model, GCN was used to extract feature encoding information from the visualized graphics of gene sequences and the attention mechanism was combined with Bi-LSTM to assess the importance of each feature in gene sequences and analyze the influences of different feature encoding methods and data imbalance. Additionally, the cross-species predictive performance of the model was evaluated through cross-validation. The results indicated that the sensitivity of the EGP Hybrid-ML model reached 0.9122. CONCLUSIONS This model demonstrated the superior predictive performance and strong generalization capabilities compared to other models. The EGP Hybrid-ML model proposed in this paper has broad application prospects in bioinformatics, chemical information, and pharmaceutical information. The codes, architectures, parameters, and datasets of the proposed model are available free of charge at GitHub ( https://github.com/gnnumsli/EGP-Hybrid-ML ).
Collapse
Affiliation(s)
- Wu Yan
- Gannan Normal University, Ganzhou, Jiangxi, 341000, China.
- Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China.
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China.
| | - Fu Yu
- Ganzhou Power Supply Branch of State Grid, Jiangxi Electric Power Co., Ltd, Ganzhou, Jiangxi, 341000, China
| | - Li Tan
- Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| | - Li Mengshan
- Gannan Normal University, Ganzhou, Jiangxi, 341000, China.
- Ganzhou Power Supply Branch of State Grid, Jiangxi Electric Power Co., Ltd, Ganzhou, Jiangxi, 341000, China.
| | - Xie Xiaojun
- Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| | - Zhou Weihong
- Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China
| | - Sheng Sheng
- Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China
| | - Wang Jun
- Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China
| | - Wu Fu-An
- Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China.
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China.
| |
Collapse
|
2
|
Campos TL, Korhonen PK, Young ND, Wang T, Song J, Marhoefer R, Chang BCH, Selzer PM, Gasser RB. Inference of Essential Genes of the Parasite Haemonchus contortus via Machine Learning. Int J Mol Sci 2024; 25:7015. [PMID: 39000124 PMCID: PMC11240989 DOI: 10.3390/ijms25137015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 06/19/2024] [Accepted: 06/21/2024] [Indexed: 07/16/2024] Open
Abstract
Over the years, comprehensive explorations of the model organisms Caenorhabditis elegans (elegant worm) and Drosophila melanogaster (vinegar fly) have contributed substantially to our understanding of complex biological processes and pathways in multicellular organisms generally. Extensive functional genomic-phenomic, genomic, transcriptomic, and proteomic data sets have enabled the discovery and characterisation of genes that are crucial for life, called 'essential genes'. Recently, we investigated the feasibility of inferring essential genes from such data sets using advanced bioinformatics and showed that a machine learning (ML)-based workflow could be used to extract or engineer features from DNA, RNA, protein, and/or cellular data/information to underpin the reliable prediction of essential genes both within and between C. elegans and D. melanogaster. As these are two distantly related species within the Ecdysozoa, we proposed that this ML approach would be particularly well suited for species that are within the same phylum or evolutionary clade. In the present study, we cross-predicted essential genes within the phylum Nematoda (evolutionary clade V)-between C. elegans and the pathogenic parasitic nematode H. contortus-and then ranked and prioritised H. contortus proteins encoded by these genes as intervention (e.g., drug) target candidates. Using strong, validated predictors, we inferred essential genes of H. contortus that are involved predominantly in crucial biological processes/pathways including ribosome biogenesis, translation, RNA binding/processing, and signalling and which are highly transcribed in the germline, somatic gonad precursors, sex myoblasts, vulva cell precursors, various nerve cells, glia, or hypodermis. The findings indicate that this in silico workflow provides a promising avenue to identify and prioritise panels/groups of drug target candidates in parasitic nematodes for experimental validation in vitro and/or in vivo.
Collapse
Affiliation(s)
- Túlio L Campos
- Department of Biosciences, Melbourne Veterinary School, Faculty of Science, The University of Melbourne, Parkville, VIC 3010, Australia
- Bioinformatics Core Facility, Aggeu Magalhães Institute (Fiocruz), Recife 50740-465, PE, Brazil
| | - Pasi K Korhonen
- Department of Biosciences, Melbourne Veterinary School, Faculty of Science, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Neil D Young
- Department of Biosciences, Melbourne Veterinary School, Faculty of Science, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Tao Wang
- Department of Biosciences, Melbourne Veterinary School, Faculty of Science, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Jiangning Song
- Department of Data Science and AI, Faculty of IT, Monash University, Melbourne, VIC 3800, Australia
- Biomedicine Discovery Institute, Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Clayton, VIC 3800, Australia
| | - Richard Marhoefer
- Boehringer Ingelheim Animal Health, Binger Strasse 173, 55216 Ingelheim am Rhein, Germany
| | - Bill C H Chang
- Department of Biosciences, Melbourne Veterinary School, Faculty of Science, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Paul M Selzer
- Boehringer Ingelheim Animal Health, Binger Strasse 173, 55216 Ingelheim am Rhein, Germany
| | - Robin B Gasser
- Department of Biosciences, Melbourne Veterinary School, Faculty of Science, The University of Melbourne, Parkville, VIC 3010, Australia
| |
Collapse
|
3
|
Chen XR, Cui YZ, Li BZ, Yuan YJ. Genome engineering on size reduction and complexity simplification: A review. J Adv Res 2024; 60:159-171. [PMID: 37442424 PMCID: PMC11156615 DOI: 10.1016/j.jare.2023.07.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 06/25/2023] [Accepted: 07/10/2023] [Indexed: 07/15/2023] Open
Abstract
BACKGROUND Genome simplification is an important topic in the field of life sciences that has attracted attention from its conception to the present day. It can help uncover the essential components of the genome and, in turn, shed light on the underlying operating principles of complex biological systems. This has made it a central focus of both basic and applied research in the life sciences. With the recent advancements in related technologies and our increasing knowledge of the genome, now is an opportune time to delve into this topic. AIM OF REVIEW Our review investigates the progress of genome simplification from two perspectives: genome size reduction and complexity simplification. In addition, we provide insights into the future development trends of genome simplification. KEY SCIENTIFIC CONCEPTS OF REVIEW Reducing genome size requires eliminating non-essential elements as much as possible. This process has been facilitated by advances in genome manipulation and synthesis techniques. However, we still need a better and clearer understanding of living systems to reduce genome complexity. As there is a lack of quantitative and clearly defined standards for this task, we have opted to approach the topic from various perspectives and present our findings accordingly.
Collapse
Affiliation(s)
- Xiang-Rong Chen
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin, China; Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin, China
| | - You-Zhi Cui
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin, China; Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin, China
| | - Bing-Zhi Li
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin, China; Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin, China.
| | - Ying-Jin Yuan
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin, China; Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin, China
| |
Collapse
|
4
|
Hu W, Li M, Xiao H, Guan L. Essential genes identification model based on sequence feature map and graph convolutional neural network. BMC Genomics 2024; 25:47. [PMID: 38200437 PMCID: PMC10777564 DOI: 10.1186/s12864-024-09958-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 01/01/2024] [Indexed: 01/12/2024] Open
Abstract
BACKGROUND Essential genes encode functions that play a vital role in the life activities of organisms, encompassing growth, development, immune system functioning, and cell structure maintenance. Conventional experimental techniques for identifying essential genes are resource-intensive and time-consuming, and the accuracy of current machine learning models needs further enhancement. Therefore, it is crucial to develop a robust computational model to accurately predict essential genes. RESULTS In this study, we introduce GCNN-SFM, a computational model for identifying essential genes in organisms, based on graph convolutional neural networks (GCNN). GCNN-SFM integrates a graph convolutional layer, a convolutional layer, and a fully connected layer to model and extract features from gene sequences of essential genes. Initially, the gene sequence is transformed into a feature map using coding techniques. Subsequently, a multi-layer GCN is employed to perform graph convolution operations, effectively capturing both local and global features of the gene sequence. Further feature extraction is performed, followed by integrating convolution and fully-connected layers to generate prediction results for essential genes. The gradient descent algorithm is utilized to iteratively update the cross-entropy loss function, thereby enhancing the accuracy of the prediction results. Meanwhile, model parameters are tuned to determine the optimal parameter combination that yields the best prediction performance during training. CONCLUSIONS Experimental evaluation demonstrates that GCNN-SFM surpasses various advanced essential gene prediction models and achieves an average accuracy of 94.53%. This study presents a novel and effective approach for identifying essential genes, which has significant implications for biology and genomics research.
Collapse
Affiliation(s)
- Wenxing Hu
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| | - Mengshan Li
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China.
| | - Haiyang Xiao
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| | - Lixin Guan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| |
Collapse
|
5
|
Wang W, Meng X, Xiang J, Shuai Y, Bedru HD, Li M. CACO: A Core-Attachment Method With Cross-Species Functional Ortholog Information to Detect Human Protein Complexes. IEEE J Biomed Health Inform 2023; 27:4569-4578. [PMID: 37399160 DOI: 10.1109/jbhi.2023.3289490] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/05/2023]
Abstract
Protein complexes play an essential role in living cells. Detecting protein complexes is crucial to understand protein functions and treat complex diseases. Due to high time and resource consumption of experiment approaches, many computational approaches have been proposed to detect protein complexes. However, most of them are only based on protein-protein interaction (PPI) networks, which heavily suffer from the noise in PPI networks. Therefore, we propose a novel core-attachment method, named CACO, to detect human protein complexes, by integrating the functional information from other species via protein ortholog relations. First, CACO constructs a cross-species ortholog relation matrix and transfers GO terms from other species as a reference to evaluate the confidence of PPIs. Then, a PPI filter strategy is adopted to clean the PPI network and thus a weighted clean PPI network is constructed. Finally, a new effective core-attachment algorithm is proposed to detect protein complexes from the weighted PPI network. Compared to other thirteen state-of-the-art methods, CACO outperforms all of them in terms of F-measure and Composite Score, showing that integrating ortholog information and the proposed core-attachment algorithm are effective in detecting protein complexes.
Collapse
|
6
|
Rout RK, Umer S, Khandelwal M, Pati S, Mallik S, Balabantaray BK, Qin H. Identification of discriminant features from stationary pattern of nucleotide bases and their application to essential gene classification. Front Genet 2023; 14:1154120. [PMID: 37152988 PMCID: PMC10156977 DOI: 10.3389/fgene.2023.1154120] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 04/04/2023] [Indexed: 05/09/2023] Open
Abstract
Introduction: Essential genes are essential for the survival of various species. These genes are a family linked to critical cellular activities for species survival. These genes are coded for proteins that regulate central metabolism, gene translation, deoxyribonucleic acid replication, and fundamental cellular structure and facilitate intracellular and extracellular transport. Essential genes preserve crucial genomics information that may hold the key to a detailed knowledge of life and evolution. Essential gene studies have long been regarded as a vital topic in computational biology due to their relevance. An essential gene is composed of adenine, guanine, cytosine, and thymine and its various combinations. Methods: This paper presents a novel method of extracting information on the stationary patterns of nucleotides such as adenine, guanine, cytosine, and thymine in each gene. For this purpose, some co-occurrence matrices are derived that provide the statistical distribution of stationary patterns of nucleotides in the genes, which is helpful in establishing the relationship between the nucleotides. For extracting discriminant features from each co-occurrence matrix, energy, entropy, homogeneity, contrast, and dissimilarity features are computed, which are extracted from all co-occurrence matrices and then concatenated to form a feature vector representing each essential gene. Finally, supervised machine learning algorithms are applied for essential gene classification based on the extracted fixed-dimensional feature vectors. Results: For comparison, some existing state-of-the-art feature representation techniques such as Shannon entropy (SE), Hurst exponent (HE), fractal dimension (FD), and their combinations have been utilized. Discussion: An extensive experiment has been performed for classifying the essential genes of five species that show the robustness and effectiveness of the proposed methodology.
Collapse
Affiliation(s)
- Ranjeet Kumar Rout
- National Institute of Technology Srinagar, Hazratbal, Jammu and Kashmir, India
| | - Saiyed Umer
- Aliah University, Kolkata, West Bengal, India
| | - Monika Khandelwal
- National Institute of Technology Srinagar, Hazratbal, Jammu and Kashmir, India
| | - Smitarani Pati
- Dr. B R Ambedkar National Institute of Technology Jalandhar, Jalandhar, Punjab, India
| | - Saurav Mallik
- Harvard T H Chan School of Public Health, Boston, United States
- Department of Pharmacology and Toxicology, University of Arizona, Tucson, AZ, United States
- *Correspondence: Saurav Mallik, , ; Hong Qin,
| | | | - Hong Qin
- Department of Computer Science and Engineering, University of Tennessee at Chattanooga, Chattanooga, TN, United States
- *Correspondence: Saurav Mallik, , ; Hong Qin,
| |
Collapse
|