1
|
Sirocchi C, Urschler M, Pfeifer B. Feature graphs for interpretable unsupervised tree ensembles: centrality, interaction, and application in disease subtyping. BioData Min 2025; 18:15. [PMID: 39955586 PMCID: PMC11829558 DOI: 10.1186/s13040-025-00430-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2024] [Accepted: 02/05/2025] [Indexed: 02/17/2025] Open
Abstract
Explainable and interpretable machine learning has emerged as essential in leveraging artificial intelligence within high-stakes domains such as healthcare to ensure transparency and trustworthiness. Feature importance analysis plays a crucial role in improving model interpretability by pinpointing the most relevant input features, particularly in disease subtyping applications, aimed at stratifying patients based on a small set of signature genes and biomarkers. While clustering methods, including unsupervised random forests, have demonstrated good performance, approaches for evaluating feature contributions in an unsupervised regime are notably scarce. To address this gap, we introduce a novel methodology to enhance the interpretability of unsupervised random forests by elucidating feature contributions through the construction of feature graphs, both over the entire dataset and individual clusters, that leverage parent-child node splits within the trees. Feature selection strategies to derive effective feature combinations from these graphs are presented and extensively evaluated on synthetic and benchmark datasets against state-of-the-art methods, standing out for performance, computational efficiency, reliability, versatility and ability to provide cluster-specific insights. In a disease subtyping application, clustering kidney cancer gene expression data over a feature subset selected with our approach reveals three patient groups with different survival outcomes. Cluster-specific analysis identifies distinctive feature contributions and interactions, essential for devising targeted interventions, conducting personalised risk assessments, and enhancing our understanding of the underlying molecular complexities.
Collapse
Affiliation(s)
- Christel Sirocchi
- Department of Pure and Applied Sciences, University of Urbino, Urbino, 61029, Italy
- Biomedical Network Science Lab, Department Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander Universität Erlangen-Nürnberg, Erlangen, 91052, Germany
| | - Martin Urschler
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, 8036, Austria
| | - Bastian Pfeifer
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, 8036, Austria.
| |
Collapse
|
2
|
Zeng Y, Zhang Y, Xiao Z, Sui H. A multi-classification deep neural network for cancer type identification from high-dimension, small-sample and imbalanced gene microarray data. Sci Rep 2025; 15:5239. [PMID: 39939378 PMCID: PMC11822135 DOI: 10.1038/s41598-025-89475-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Accepted: 02/05/2025] [Indexed: 02/14/2025] Open
Abstract
Gene microarray technology provides an efficient way to diagnose cancer. However, microarray gene expression data face the challenges of high-dimension, small-sample, and multi-class imbalance. The coupling of these challenges leads to inaccurate results when using traditional feature selection and classification algorithms. Due to fast learning speed and good classification performance, deep neural network such as generative adversarial network has been proven one of the best classification algorithms, especially in bioinformatics domain. However, it is limited to binary application and inefficient in processing high-dimensional sparse features. This paper proposes a multi-classification generative adversarial network model combined with features bundling (MGAN-FB) to handle the coupling of high-dimension, small-sample, and multi-class imbalance for gene microarray data classification at both feature and algorithmic levels. At feature level, a deep encoder structure combining feature bundling (FB) mechanism and squeeze and excite (SE) mechanism, is designed for the generator. So, the sparsity, correlation and consequence of high-dimension features are all taken into consideration for adaptive features extraction. It achieves effective dimensionality reduction without transitional information loss. At algorithmic level, a softmax module coupled with multi-classifier are introduced into the discriminator, with a new objective function is distinctively designed for the proposed MGAN-FB model, considering encode loss, reconstruction loss, discrimination loss and multi-classification loss. We extend generative adversaria framework from the binary classification to the multi-classification field. Experiments are performed on eight open-source gene microarray datasets from classification performance, running time and non-parametric tests, which demonstrate that the proposed method has obvious advantages over other 7 compared methods.
Collapse
Affiliation(s)
- Yifu Zeng
- Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China
- Department of Information Technology, The Second Affiliated Hospital of Fujian Medical University, Quanzhou, China
| | - Yixiang Zhang
- Department of Infectious Diseases, The Second Affiliated Hospital of Fujian Medical University, Quanzhou, China
| | - Zikai Xiao
- Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China
| | - He Sui
- College of Aeronautical Engineering, Civil Aviation University of China, Tianjin, 300300, China.
- Information Security Evaluation Center, Civil Aviation University of China, Tianjin, 300300, China.
| |
Collapse
|
3
|
Yang P, Qiu H, Yang X, Wang L, Wang X. SAGL: A self-attention-based graph learning framework for predicting survival of colorectal cancer patients. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 249:108159. [PMID: 38583291 DOI: 10.1016/j.cmpb.2024.108159] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Revised: 02/28/2024] [Accepted: 03/29/2024] [Indexed: 04/09/2024]
Abstract
BACKGROUND AND OBJECTIVE Colorectal cancer (CRC) is one of the most commonly diagnosed cancers worldwide. The accurate survival prediction for CRC patients plays a significant role in the formulation of treatment strategies. Recently, machine learning and deep learning approaches have been increasingly applied in cancer survival prediction. However, most existing methods inadequately represent and leverage the dependencies among features and fail to sufficiently mine and utilize the comorbidity patterns of CRC. To address these issues, we propose a self-attention-based graph learning (SAGL) framework to improve the postoperative cancer-specific survival prediction for CRC patients. METHODS We present a novel method for constructing dependency graph (DG) to reflect two types of dependencies including comorbidity-comorbidity dependencies and the dependencies between features related to patient characteristics and cancer treatments. This graph is subsequently refined by a disease comorbidity network, which offers a holistic view of comorbidity patterns of CRC. A DG-guided self-attention mechanism is proposed to unearth novel dependencies beyond what DG offers, thus augmenting CRC survival prediction. Finally, each patient will be represented, and these representations will be used for survival prediction. RESULTS The experimental results show that SAGL outperforms state-of-the-art methods on a real-world dataset, with the receiver operating characteristic curve for 3- and 5-year survival prediction achieving 0.849±0.002 and 0.895±0.005, respectively. In addition, the comparison results with different graph neural network-based variants demonstrate the advantages of our DG-guided self-attention graph learning framework. CONCLUSIONS Our study reveals that the potential of the DG-guided self-attention in optimizing feature graph learning which can improve the performance of CRC survival prediction.
Collapse
Affiliation(s)
- Ping Yang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, PR China
| | - Hang Qiu
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, PR China; Big Data Research Center, University of Electronic Science and Technology of China, Chengdu, 611731, PR China.
| | - Xulin Yang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, PR China
| | - Liya Wang
- Big Data Research Center, University of Electronic Science and Technology of China, Chengdu, 611731, PR China
| | - Xiaodong Wang
- Department of Gastrointestinal Surgery, West China Hospital, Sichuan University, Chengdu, 610041, PR China.
| |
Collapse
|
4
|
Yang B, Wang L, Bao W. Identify Diabetes-related Targets based on ForgeNet_GPC. Curr Comput Aided Drug Des 2024; 20:1042-1054. [PMID: 38173214 DOI: 10.2174/0115734099258183230929173855] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 08/06/2023] [Accepted: 08/15/2023] [Indexed: 01/05/2024]
Abstract
BACKGROUND Research on potential therapeutic targets and new mechanisms of action can greatly improve the efficiency of new drug development. AIMS Polygenic genetic diseases, such as diabetes, are caused by the interaction of multiple gene loci and environmental factors. OBJECTIVES In this study, a disease target identification algorithm based on protein recognition is proposed. MATERIALS AND METHODS In this method, the related and unrelated targets are collected from literature databases for treating diabetes. The transcribed proteins corresponding to each target are queried in order to construct a protein dataset. Six protein feature extraction algorithms (AAC, CKSAAGP, DDE, DPC, GAAP, and TPC) are utilized to obtain the feature vectors of each protein, which are merged into the full feature vectors. RESULTS A novel classifier (forgeNet_GPC) based on forgeNet and Gaussian process classifier (GPC) is proposed to classify the proteins. CONCLUSION In forgeNet_GPC, forgeNet is utilized to select the important features, and GPC is utilized to solve the classification problem. The experimental results reveal that forgeNet_GPC performs better than 22 classifiers in terms of ROC-AUC, PR-AUC, MCC, Youden Index, and Kappa.
Collapse
Affiliation(s)
- Bin Yang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Linlin Wang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Wenzheng Bao
- School of Information and Electrical Engineering, Xuzhou University of Technology, Xuzhou, 221018, China
| |
Collapse
|
5
|
Wang H, Yao Z, Luo R, Liu J, Wang Z, Zhang G. LaCOme: Learning the latent convolutional patterns among transcriptomic features to improve classifications. Gene 2023; 862:147246. [PMID: 36736509 DOI: 10.1016/j.gene.2023.147246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 12/22/2022] [Accepted: 01/27/2023] [Indexed: 02/04/2023]
Abstract
OMIC is a novel approach that analyses entire genetic or molecular profiles in humans and other organisms. It involves identifying and quantifying biological molecules that contribute to a species' structure, function, and dynamics. Finding the secrets of OMIC is like deciphering the biochemical code, but building data-driven models to mine the hidden phenotypic trait information has been a research hotspot. Transcriptome analysis is a popular biological technology for characterizing living systems' overall health, including cells and tissues. Individual transcript expression levels are known to be correlated with those of other transcripts. Nevertheless, most computational studies do not fully exploit these inter-feature correlations. Differential expression analyses, for example, assume that the expression levels of the transcripts are independent. Thus, we propose extracting these inter-feature correlations using the convolutional neural network (CNN) and transforming the transcriptomic features into a new space of convolutional transcriptomic (LaCOme) features. On most transcriptomic datasets in use, a series of comprehensive experiments have demonstrated that engineered LaCOme features outperform the original transcriptomic features in classification performances. Based on experimental results, OMIC data from biological samples could be further enriched using CNN to enhance computational analysis results. Also, feature rough screening can be used to extract valuable information from OMIC, regardless of the algorithm used to select features. It may always be better to create a novel feature than to keep the original. Furthermore, we investigated the feasibility of the feature construction method through cross-validation and independent verification, hoping to develop a more efficient and effective method.
Collapse
Affiliation(s)
- Hongyu Wang
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning 110016, China; College of Software, Jilin University, Changchun, Jilin 130012, China
| | - Zhaomin Yao
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning 110167, China
| | - Renli Luo
- College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning 110167, China
| | - Jiahao Liu
- School of Mathematical Sciences, Chongqing Normal University, Chongqing 401331, China
| | - Zhiguo Wang
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning 110167, China.
| | - Guoxu Zhang
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning 110167, China.
| |
Collapse
|
6
|
Mohammed MA, Abdulkareem KH, Dinar AM, Zapirain BG. Rise of Deep Learning Clinical Applications and Challenges in Omics Data: A Systematic Review. Diagnostics (Basel) 2023; 13:664. [PMID: 36832152 PMCID: PMC9955380 DOI: 10.3390/diagnostics13040664] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2022] [Revised: 02/05/2023] [Accepted: 02/07/2023] [Indexed: 02/12/2023] Open
Abstract
This research aims to review and evaluate the most relevant scientific studies about deep learning (DL) models in the omics field. It also aims to realize the potential of DL techniques in omics data analysis fully by demonstrating this potential and identifying the key challenges that must be addressed. Numerous elements are essential for comprehending numerous studies by surveying the existing literature. For example, the clinical applications and datasets from the literature are essential elements. The published literature highlights the difficulties encountered by other researchers. In addition to looking for other studies, such as guidelines, comparative studies, and review papers, a systematic approach is used to search all relevant publications on omics and DL using different keyword variants. From 2018 to 2022, the search procedure was conducted on four Internet search engines: IEEE Xplore, Web of Science, ScienceDirect, and PubMed. These indexes were chosen because they offer enough coverage and linkages to numerous papers in the biological field. A total of 65 articles were added to the final list. The inclusion and exclusion criteria were specified. Of the 65 publications, 42 are clinical applications of DL in omics data. Furthermore, 16 out of 65 articles comprised the review publications based on single- and multi-omics data from the proposed taxonomy. Finally, only a small number of articles (7/65) were included in papers focusing on comparative analysis and guidelines. The use of DL in studying omics data presented several obstacles related to DL itself, preprocessing procedures, datasets, model validation, and testbed applications. Numerous relevant investigations were performed to address these issues. Unlike other review papers, our study distinctly reflects different observations on omics with DL model areas. We believe that the result of this study can be a useful guideline for practitioners who look for a comprehensive view of the role of DL in omics data analysis.
Collapse
Affiliation(s)
- Mazin Abed Mohammed
- College of Computer Science and Information Technology, University of Anbar, Anbar 31001, Iraq
- eVIDA Lab, University of Deusto, 48007 Bilbao, Spain
| | - Karrar Hameed Abdulkareem
- College of Agriculture, Al-Muthanna University, Samawah 66001, Iraq
- College of Engineering, University of Warith Al-Anbiyaa, Karbala 56001, Iraq
| | - Ahmed M. Dinar
- Computer Engineering Department, University of Technology- Iraq, Baghdad 19006, Iraq
| | | |
Collapse
|
7
|
Disease-related compound identification based on deeping learning method. Sci Rep 2022; 12:20594. [PMID: 36446871 PMCID: PMC9708143 DOI: 10.1038/s41598-022-24385-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Accepted: 11/15/2022] [Indexed: 12/02/2022] Open
Abstract
Acute lung injury (ALI) is a serious respiratory disease, which can lead to acute respiratory failure or death. It is closely related to the pathogenesis of New Coronavirus pneumonia (COVID-19). Many researches showed that traditional Chinese medicine (TCM) had a good effect on its intervention, and network pharmacology could play a very important role. In order to construct "disease-gene-target-drug" interaction network more accurately, deep learning algorithm is utilized in this paper. Two ALI-related target genes (REAL and SATA3) are considered, and the active and inactive compounds of the two corresponding target genes are collected as training data, respectively. Molecular descriptors and molecular fingerprints are utilized to characterize each compound. Forest graph embedded deep feed forward network (forgeNet) is proposed to train. The experimental results show that forgeNet performs better than support vector machines (SVM), random forest (RF), logical regression (LR), Naive Bayes (NB), XGBoost, LightGBM and gcForest. forgeNet could identify 19 compounds in Erhuang decoction (EhD) and Dexamethasone (DXMS) more accurately.
Collapse
|
8
|
Kumar R, Khatri A, Acharya V. Deep learning uncovers distinct behavior of rice network to pathogens response. iScience 2022; 25:104546. [PMID: 35754717 PMCID: PMC9218438 DOI: 10.1016/j.isci.2022.104546] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 05/06/2022] [Accepted: 06/02/2022] [Indexed: 12/15/2022] Open
Abstract
Rice, apart from abiotic stress, is prone to attack from multiple pathogens. Predominantly, the two rice pathogens, bacterial Xanthomonas oryzae (Xoo) and hemibiotrophic fungus, Magnaporthe oryzae, are extensively well explored for more than the last decade. However, because of lack of holistic studies, we design a deep learning-based rice network model (DLNet) that has explored the quantitative differences resulting in the distinct rice network architecture. Validation studies on rice in response to biotic stresses show that DLNet outperforms other machine learning methods. The current finding indicates the compactness of the rice PTI network and the rise of independent modules in the rice ETI network, resulting in similar patterns of the plant immune response. The results also show more independent network modules and minimum structural disorderness in rice-M. oryzae as compared to the rice-Xoo model revealing the different adaptation strategies of the rice plant to evade pathogen effectors.
Collapse
Affiliation(s)
- Ravi Kumar
- Functional Genomics and Complex System Lab, Biotechnology Division, The Himalayan Centre for High-throughput Computational Biology (HiCHiCoB, A BIC Supported by DBT, India), CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
| | - Abhishek Khatri
- Functional Genomics and Complex System Lab, Biotechnology Division, The Himalayan Centre for High-throughput Computational Biology (HiCHiCoB, A BIC Supported by DBT, India), CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh, India
| | - Vishal Acharya
- Functional Genomics and Complex System Lab, Biotechnology Division, The Himalayan Centre for High-throughput Computational Biology (HiCHiCoB, A BIC Supported by DBT, India), CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
| |
Collapse
|
9
|
Yang B, Bao W, Chen B. Disease-Ligand Identification Based on Flexible Neural Tree. Front Microbiol 2022; 13:912145. [PMID: 35733966 PMCID: PMC9207514 DOI: 10.3389/fmicb.2022.912145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Accepted: 05/06/2022] [Indexed: 12/04/2022] Open
Abstract
In order to screen the disease-related compounds of a traditional Chinese medicine prescription in network pharmacology research accurately, a new virtual screening method based on flexible neural tree (FNT) model, hybrid evolutionary method and negative sample selection algorithm is proposed. A novel hybrid evolutionary algorithm based on the Grammar-guided genetic programming and salp swarm algorithm is proposed to infer the optimal FNT. According to hypertension, diabetes, and Corona Virus Disease 2019, disease-related compounds are collected from the up-to-date literatures. The unrelated compounds are chosen by negative sample selection algorithm. ECFP6, MACCS, Macrocycle, and RDKit are utilized to numerically characterize the chemical structure of each compound collected, respectively. The experiment results show that our proposed method performs better than classical classifiers [Support Vector Machine (SVM), random forest (RF), AdaBoost, decision tree (DT), Gradient Boosting Decision Tree (GBDT), KNN, logic regression (LR), and Naive Bayes (NB)], up-to-date classifier (gcForest), and deep learning method (forgeNet) in terms of AUC, ROC, TPR, FPR, Precision, Specificity, and F1. MACCS method is suitable for the maximum number of classifiers. All methods perform poorly with ECFP6 molecular descriptor.
Collapse
Affiliation(s)
- Bin Yang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, China
| | - Wenzheng Bao
- School of Information and Electrical Engineering, Xuzhou University of Technology, Xuzhou, China
- *Correspondence: Wenzheng Bao,
| | | |
Collapse
|
10
|
St Clair R, Teti M, Pavlovic M, Hahn W, Barenholtz E. Predicting residues involved in anti-DNA autoantibodies with limited neural networks. Med Biol Eng Comput 2022; 60:1279-1293. [PMID: 35303216 PMCID: PMC8932093 DOI: 10.1007/s11517-022-02539-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Accepted: 01/10/2022] [Indexed: 11/30/2022]
Abstract
Abstract Computer-aided rational vaccine design (RVD) and synthetic pharmacology are rapidly developing fields that leverage existing datasets for developing compounds of interest. Computational proteomics utilizes algorithms and models to probe proteins for functional prediction. A potentially strong target for computational approach is autoimmune antibodies, which are the result of broken tolerance in the immune system where it cannot distinguish “self” from “non-self” resulting in attack of its own structures (proteins and DNA, mainly). The information on structure, function, and pathogenicity of autoantibodies may assist in engineering RVD against autoimmune diseases. Current computational approaches exploit large datasets curated with extensive domain knowledge, most of which include the need for many resources and have been applied indirectly to problems of interest for DNA, RNA, and monomer protein binding. We present a novel method for discovering potential binding sites. We employed long short-term memory (LSTM) models trained on FASTA primary sequences to predict protein binding in DNA-binding hydrolytic antibodies (abzymes). We also employed CNN models applied to the same dataset for comparison with LSTM. While the CNN model outperformed the LSTM on the primary task of binding prediction, analysis of internal model representations of both models showed that the LSTM models recovered sub-sequences that were strongly correlated with sites known to be involved in binding. These results demonstrate that analysis of internal processes of LSTM models may serve as a powerful tool for primary sequence analysis. Graphical abstract ![]()
Collapse
Affiliation(s)
- Rachel St Clair
- Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton, USA.
| | - Michael Teti
- Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton, USA
| | - Mirjana Pavlovic
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, USA
| | - William Hahn
- Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton, USA
| | - Elan Barenholtz
- Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton, USA
| |
Collapse
|
11
|
Yang B, Bao W, Wang J. Active disease-related compound identification based on capsule network. Brief Bioinform 2022; 23:bbab462. [PMID: 35057581 PMCID: PMC8690041 DOI: 10.1093/bib/bbab462] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 09/30/2021] [Accepted: 10/07/2021] [Indexed: 01/03/2023] Open
Abstract
Pneumonia, especially corona virus disease 2019 (COVID-19), can lead to serious acute lung injury, acute respiratory distress syndrome, multiple organ failure and even death. Thus it is an urgent task for developing high-efficiency, low-toxicity and targeted drugs according to pathogenesis of coronavirus. In this paper, a novel disease-related compound identification model-based capsule network (CapsNet) is proposed. According to pneumonia-related keywords, the prescriptions and active components related to the pharmacological mechanism of disease are collected and extracted in order to construct training set. The features of each component are extracted as the input layer of capsule network. CapsNet is trained and utilized to identify the pneumonia-related compounds in Qingre Jiedu injection. The experiment results show that CapsNet can identify disease-related compounds more accurately than SVM, RF, gcForest and forgeNet.
Collapse
Affiliation(s)
- Bin Yang
- School of Information science and Engineering, Zaozhuang University, Zaozhuang, China 277160
| | - Wenzheng Bao
- School of Information and Electrical Engineering, Xuzhou University of Technology, Xuzhou, China 221018
| | - Jinglong Wang
- College of Food Science and Pharmaceutical Engineering, Zaozhuang University, Zaozhuang 277160, China
| |
Collapse
|
12
|
Yang B. Gene Regulatory Network Identification based on Forest Graph-embedded Deep Feedforward Network. 2021 6TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTERNET OF THINGS 2021. [DOI: 10.1145/3493287.3493297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
13
|
Tan K, Huang W, Liu X, Hu J, Dong S. A Hierarchical Graph Convolution Network for Representation Learning of Gene Expression Data. IEEE J Biomed Health Inform 2021; 25:3219-3229. [PMID: 33449889 DOI: 10.1109/jbhi.2021.3052008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The curse of dimensionality, which is caused by high-dimensionality and low-sample-size, is a major challenge in gene expression data analysis. However, the real situation is even worse: labelling data is laborious and time-consuming, so only a small part of the limited samples will be labelled. Having such few labelled samples further increases the difficulty of training deep learning models. Interpretability is an important requirement in biomedicine. Many existing deep learning methods are trying to provide interpretability, but rarely apply to gene expression data. Recent semi-supervised graph convolution network methods try to address these problems by smoothing the label information over a graph. However, to the best of our knowledge, these methods only utilize graphs in either the feature space or sample space, which restrict their performance. We propose a transductive semi-supervised representation learning method called a hierarchical graph convolution network (HiGCN) to aggregate the information of gene expression data in both feature and sample spaces. HiGCN first utilizes external knowledge to construct a feature graph and a similarity kernel to construct a sample graph. Then, two spatial-based GCNs are used to aggregate information on these graphs. To validate the model's performance, synthetic and real datasets are provided to lend empirical support. Compared with two recent models and three traditional models, HiGCN learns better representations of gene expression data, and these representations improve the performance of downstream tasks, especially when the model is trained on a few labelled samples. Important features can be extracted from our model to provide reliable interpretability.
Collapse
|
14
|
Shen C, Luo J, Ouyang W, Ding P, Chen X. IDDkin: Network-based influence deep diffusion model for enhancing prediction of kinase inhibitors. Bioinformatics 2020; 36:5481-5491. [PMID: 33367525 DOI: 10.1093/bioinformatics/btaa1058] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2020] [Revised: 11/09/2020] [Accepted: 12/10/2020] [Indexed: 01/01/2023] Open
Abstract
MOTIVATION Protein kinases have been the focus of drug discovery research for many years because they play a causal role in many human diseases. Understanding the binding profile of kinase inhibitors is a prerequisite for drug discovery, and traditional methods of predicting kinase inhibitors are time-consuming and inefficient. Calculation-based predictive methods provide a relatively low-cost and high-efficiency approach to the rapid development and effective understanding of the binding profile of kinase inhibitors. Particularly, the continuous improvement of network pharmacology methods provides unprecedented opportunities for drug discovery, network-based computational methods could be employed to aggregate the effective information from heterogeneous sources, which have become a new way for predicting the binding profile of kinase inhibitors. RESULTS In this study, we proposed a network-based influence deep diffusion model, named IDDkin, for enhancing the prediction of kinase inhibitors. IDDkin uses deep graph convolutional networks, graph attention networks and adaptive weighting methods to diffuse the effective information of heterogeneous networks. The updated kinase and compound representations are used to predict potential compound-kinase pairs. The experimental results show that the performance of IDDkin is superior to the comparison methods, including the state-of-the art kinase inhibitor prediction method and the classic model widely used in relationship prediction. In experiments conducted to verify its generalizability and in case studies, the IDDkin model also shows excellent performance. All of these results demonstrate the powerful predictive ability of the IDDkin model in the field of kinase inhibitors. AVAILABILITY AND IMPLEMENTATION Source code and data can be downloaded from https://github.com/ CS-BIO/IDDkin. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Cong Shen
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410083, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410083, China
| | - Wenjue Ouyang
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410083, China
| | - Pingjian Ding
- School of Computer Science, University of South China, Hengyang, 421001, China
| | - Xiangtao Chen
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410083, China
| |
Collapse
|