1
|
Chen L, Zhang S, Zhou B. Herb-disease association prediction model based on network consistency projection. Sci Rep 2025; 15:3328. [PMID: 39865145 PMCID: PMC11770172 DOI: 10.1038/s41598-025-87521-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2024] [Accepted: 01/20/2025] [Indexed: 01/28/2025] Open
Abstract
A growing number of biological and clinical reports indicate the usefulness of herbs in the treatment of complex human diseases, giving an essential supplement for modern medicine. Similar to drugs, the use of experimental validation to identify related diseases of given herbs is both expensive and time-consuming. Such validation is even more difficult because each herb always contains several components. It is alternative to design computational models to predict herb-disease associations (HDAs). Nevertheless, only a few computational models have been developed for HDA prediction. In this study, we make full use of several properties of herbs and diseases, which are collected in a public database HERB, to design a model named HDAPM-NCP for predicting HDAs. Based on these properties, six herb kernels and five disease kernels are constructed, which are further fused into one unified herb kernel and one disease kernel. These kernels and herb-disease adjacency matrix are fed into network consistency projection to quantify the strength of herb-disease pairs. The cross-validation results show the high performance of HDAPM-NCP. Such performance is higher than that of two previous models. The ablation experiments prove the effects of modules in this model. Finally, we also analyze the weakness and strength of the model, uncovering which herb-disease pairs that HDAPM-NCP can yield reliable or unsatisfied predictions, and a case study is conducted to prove that HDAPM-NCP can discover latent HDAs.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China.
| | - Shiyi Zhang
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China
| | - Bo Zhou
- School of Basic Medical Sciences, Shanghai University of Medicine and Health Sciences, Shanghai, 201318, China
| |
Collapse
|
2
|
Gu Y, Zheng S, Zhang B, Kang H, Jiang R, Li J. Deep multiple instance learning on heterogeneous graph for drug-disease association prediction. Comput Biol Med 2025; 184:109403. [PMID: 39577348 DOI: 10.1016/j.compbiomed.2024.109403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 11/05/2024] [Accepted: 11/08/2024] [Indexed: 11/24/2024]
Abstract
Drug repositioning offers promising prospects for accelerating drug discovery by identifying potential drug-disease associations (DDAs) for existing drugs and diseases. Previous methods have generated meta-path-augmented node or graph embeddings for DDA prediction in drug-disease heterogeneous networks. However, these approaches rarely develop end-to-end frameworks for path instance-level representation learning as well as the further feature selection and aggregation. By leveraging the abundant topological information in path instances, more fine-grained and interpretable predictions can be achieved. To this end, we introduce deep multiple instance learning into drug repositioning by proposing a novel method called MilGNet. MilGNet employs a heterogeneous graph neural network (HGNN)-based encoder to learn drug and disease node embeddings. Treating each drug-disease pair as a bag, we designed a special quadruplet meta-path form and implemented a pseudo meta-path generator in MilGNet to obtain multiple meta-path instances based on network topology. Additionally, a bidirectional instance encoder enhances the representation of meta-path instances. Finally, MilGNet utilizes a multi-scale interpretable predictor to aggregate bag embeddings with an attention mechanism, providing predictions at both the bag and instance levels for accurate and explainable predictions. Comprehensive experiments on five benchmarks demonstrate that MilGNet significantly outperforms ten advanced methods. Notably, three case studies on one drug (Methotrexate) and two diseases (Renal Failure and Mismatch Repair Cancer Syndrome) highlight MilGNet's potential for discovering new indications, therapies, and generating rational meta-path instances to investigate possible treatment mechanisms. The source code is available at https://github.com/gu-yaowen/MilGNet.
Collapse
Affiliation(s)
- Yaowen Gu
- Institute of Medical Information, Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS&PUMC), Beijing, 100020, China; Department of Chemistry, New York University, NY, 10027, USA.
| | - Si Zheng
- Institute of Medical Information, Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS&PUMC), Beijing, 100020, China; Institute for Artificial Intelligence, Department of Computer Science and Technology, BNRist, Tsinghua University, Beijing, 100084, China
| | - Bowen Zhang
- Beijing StoneWise Technology Co Ltd., Beijing, 100080, China
| | - Hongyu Kang
- Institute of Medical Information, Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS&PUMC), Beijing, 100020, China
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Jiao Li
- Institute of Medical Information, Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS&PUMC), Beijing, 100020, China.
| |
Collapse
|
3
|
Ma Q, Chen L, Feng K, Guo W, Huang T, Cai YD. Exploring Prognostic Gene Factors in Breast Cancer via Machine Learning. Biochem Genet 2024; 62:5022-5050. [PMID: 38383836 DOI: 10.1007/s10528-024-10712-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Accepted: 01/21/2024] [Indexed: 02/23/2024]
Abstract
Breast cancer remains the most prevalent cancer in women. To date, its underlying molecular mechanisms have not been fully uncovered. The determination of gene factors is important to improve our understanding on breast cancer, which can correlate the specific gene expression and tumor staging. However, the knowledge in this regard is still far from complete. Thus, this study aimed to explore these knowledge gaps by analyzing existing gene expression profile data from 3149 breast cancer samples, where each sample was represented by the expression of 19,644 genes and classified into Nottingham histological grade (NHG) classes (Grade 1, 2, and 3). To this end, a machine learning-based framework was designed. First, the profile data were analyzed by using seven feature ranking algorithms to evaluate the importance of features (genes). Seven feature lists were generated, each of which sorted features in accordance with feature importance evaluated from a special aspect. Then, the incremental feature selection method was applied to each list to determine essential features for classification and building efficient classifiers. Consequently, overlapping genes, such as AURKA, CBX2, and MYBL2, were deemed as potentially related to breast cancer malignancy and prognosis, indicating that such genes were identified to be important by multiple feature ranking algorithms. In addition, the study formulated classification rules to reflect special gene expression patterns for three NHG classes. Some genes and rules were analyzed and supported by recent literature, providing new references for studying breast cancer.
Collapse
Affiliation(s)
- QingLan Ma
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, China
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, 510507, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai, 200030, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, China.
| |
Collapse
|
4
|
Wu W, He T, Hao X, Xu K, Zeng J, Gu J, Chen L. Machine learning based method for analyzing vibration and noise in large cruise ships. PLoS One 2024; 19:e0307835. [PMID: 39052593 PMCID: PMC11271874 DOI: 10.1371/journal.pone.0307835] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2024] [Accepted: 07/09/2024] [Indexed: 07/27/2024] Open
Abstract
Cruise ships are distinguished as special passenger ships, transporting passengers to various ports and giving importance to comfort. High comfort can attract lots of passengers and generate substantial profits. Vibration and noise are the most important indicators for assessing the comfort of cruise ships. Existing methods for analyzing vibration and noise data have shown limitations in uncovering essential information and discerning critical disparities in vibration and noise levels across different ship districts. Conversely, the rapid development in machine learning present an opportunity to leverage sophisticated algorithms for a more insightful examination of vibration and noise aboard cruise ships. This study designed a machine learning-driven approach to analyze the vibration and noise data. Drawing data from China's first large-scale cruise ship, encompassing 127 noise samples, this study sets up a classification task, where decks were assigned as labels and frequencies served as features. Essential information was extracted by investigating this problem. Several machine learning algorithms, including feature ranking, selection, and classification algorithms, were adopted in this method. One or two essential noise frequencies related to each of the decks, except the 10th deck, were obtained, which were partly validated by the traditional statistical methods. Such findings were helpful in reducing and controlling the vibration and noise in cruise ships. Furthermore, the study develops a classifier to distinguish noise samples, which utilizes random forest as the classification algorithm with eight optimal frequency features identified by LightGBM. This classifier yielded a Matthews correlation coefficient of 0.3415. This study gives a new direction for investigating vibration and noise in ships.
Collapse
Affiliation(s)
- Wenwei Wu
- China Ship Scientific Research Center, Wuxi, Jiangsu Province, China
| | - Tao He
- China Ship Scientific Research Center, Wuxi, Jiangsu Province, China
| | - Xiaying Hao
- China Ship Scientific Research Center, Wuxi, Jiangsu Province, China
| | - Kaiwei Xu
- China Ship Scientific Research Center, Wuxi, Jiangsu Province, China
| | - Ji Zeng
- College of Ocean Science and Engineering, Shanghai Maritime University, Shanghai, China
| | - Jiahui Gu
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| |
Collapse
|
5
|
Ma Q, Shen Y, Guo W, Feng K, Huang T, Cai Y. Machine Learning Reveals Impacts of Smoking on Gene Profiles of Different Cell Types in Lung. Life (Basel) 2024; 14:502. [PMID: 38672772 PMCID: PMC11051039 DOI: 10.3390/life14040502] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 04/03/2024] [Accepted: 04/10/2024] [Indexed: 04/28/2024] Open
Abstract
Smoking significantly elevates the risk of lung diseases such as chronic obstructive pulmonary disease (COPD) and lung cancer. This risk is attributed to the harmful chemicals in tobacco smoke that damage lung tissue and impair lung function. Current research on the impact of smoking on gene expression in specific lung cells is limited. This study addresses this gap by analyzing gene expression profiles at the single-cell level from 43,539 lung endothelial cells, 234,349 lung epithelial cells, 189,843 lung immune cells, and 16,031 lung stromal cells using advanced machine learning techniques. The data, categorized by different lung cell types, were classified into three smoking states: active smoker, former smoker, and never smoker. Each cell sample encompassed 28,024 feature genes. Employing an incremental feature selection method within a computational framework, several specific genes have been identified as potential markers of smoking status in different lung cell types. These include B2M, EEF1A1, and TPT1 in lung endothelial cells; FTL and MT-ATP8 in lung epithelial cells; HLA-B and HLA-C in lung immune cells; and HSP90B1 and LCN2 in lung stroma cells. Additionally, this study developed quantitative rules for representing the gene expression patterns related to smoking. This research highlights the potential of machine learning in oncology, enhancing our molecular understanding of smoking's harm and laying the groundwork for future mechanism-based studies.
Collapse
Affiliation(s)
- Qinglan Ma
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
| | - Yulong Shen
- Department of Radiotherapy, Strategic Support Force Medical Center, Beijing 100101, China;
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai 200030, China;
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510507, China;
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yudong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
| |
Collapse
|
6
|
Ren J, Gao Q, Zhou X, Chen L, Guo W, Feng K, Huang T, Cai YD. Identification of key gene expression associated with quality of life after recovery from COVID-19. Med Biol Eng Comput 2024; 62:1031-1048. [PMID: 38123886 DOI: 10.1007/s11517-023-02988-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 11/30/2023] [Indexed: 12/23/2023]
Abstract
Post-acute sequelae of COVID-19 (PASC) is a persistent complication of severe acute respiratory syndrome coronavirus 2 infection that includes symptoms, such as fatigue, cognitive impairment, and respiratory distress. These symptoms severely affect the quality of life of patients after their recovery from COVID-19. In this study, a group of machine learning algorithms analyzed the whole blood RNA-seq data from patients with different PASC levels. The purpose of this analysis was to identify the gene markers associated with PASC and the special expression patterns for different PASC levels. By comparing the quality of life of patients after the acute phase of COVID-19 and before the disease, samples in the dataset were divided into three groups, namely, "Better," "The Same," and "Worse." Each patient was represented by the expression levels of 58,929 genes. The machine learning-based workflow included six feature-ranking algorithms, incremental feature selection (IFS), and four classification algorithms. The feature ranking algorithms were in charge of assessing feature importance, whereas IFS with classification algorithms were used to extract essential genes and to construct efficient classifiers and classification rules. The expression of top genes in the results was associated with the immune response to viral infection, which is supported by the published literature. For example, patients with low CCDC18 expression and high CPED1 expression had good quality of life, whereas those with low CDC16 expression had poor quality of life.
Collapse
Affiliation(s)
- JingXin Ren
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Qian Gao
- Department of Pharmacy, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, China
| | - XianChao Zhou
- Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai, 200030, China
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, 510507, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, China.
| |
Collapse
|
7
|
Ren J, Zhou X, Huang K, Chen L, Guo W, Feng K, Huang T, Cai YD. Identification of key genes associated with persistent immune changes and secondary immune activation responses induced by influenza vaccination after COVID-19 recovery by machine learning methods. Comput Biol Med 2024; 169:107883. [PMID: 38157776 DOI: 10.1016/j.compbiomed.2023.107883] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 11/27/2023] [Accepted: 12/18/2023] [Indexed: 01/03/2024]
Abstract
COVID-19 is hypothesized to exert enduring effects on the immune systems of patients, leading to alterations in immune-related gene expression. This study aimed to scrutinize the persistent implications of SARS-CoV-2 infection on gene expression and its influence on subsequent immune activation responses. We designed a machine learning-based approach to analyze transcriptomic data from both healthy individuals and patients who had recovered from COVID-19. Patients were categorized based on their influenza vaccination status and then compared with healthy controls. The initial sample set encompassed 86 blood samples from healthy controls and 72 blood samples from recuperated COVID-19 patients prior to influenza vaccination. The second sample set included 123 blood samples from healthy controls and 106 blood samples from recovered COVID-19 patients who had been vaccinated against influenza. For each sample, the dataset captured expression levels of 17,060 genes. Above two sample sets were first analyzed by seven feature ranking algorithms, yielding seven feature lists for each dataset. Then, each list was fed into the incremental feature selection method, incorporating three classic classification algorithms, to extract essential genes, classification rules and build efficient classifiers. The genes and rules were analyzed in this study. The main findings included that NEXN and ZNF354A were highly expressed in recovered COVID-19 patients, whereas MKI67 and GZMB were highly expressed in patients with secondary immune activation post-COVID-19 recovery. These pivotal genes could provide valuable insights for future health monitoring of COVID-19 patients and guide the creation of continued treatment regimens.
Collapse
Affiliation(s)
- Jingxin Ren
- School of Life Sciences, Shanghai University, Shanghai, 200444, China.
| | - XianChao Zhou
- Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China.
| | - Ke Huang
- School of Life Science and Technology, Shanghai Tech University, Shanghai, 201210, China.
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, China.
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai, 200030, China.
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, 510507, China.
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China; CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, China.
| |
Collapse
|
8
|
Chen L, Zhang C, Xu J. PredictEFC: a fast and efficient multi-label classifier for predicting enzyme family classes. BMC Bioinformatics 2024; 25:50. [PMID: 38291384 PMCID: PMC10829269 DOI: 10.1186/s12859-024-05665-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Accepted: 01/22/2024] [Indexed: 02/01/2024] Open
Abstract
BACKGROUND Enzymes play an irreplaceable and important role in maintaining the lives of living organisms. The Enzyme Commission (EC) number of an enzyme indicates its essential functions. Correct identification of the first digit (family class) of the EC number for a given enzyme is a hot topic in the past twenty years. Several previous methods adopted functional domain composition to represent enzymes. However, it would lead to dimension disaster, thereby reducing the efficiency of the methods. On the other hand, most previous methods can only deal with enzymes belonging to one family class. In fact, several enzymes belong to two or more family classes. RESULTS In this study, a fast and efficient multi-label classifier, named PredictEFC, was designed. To construct this classifier, a novel feature extraction scheme was designed for processing functional domain information of enzymes, which counting the distribution of each functional domain entry across seven family classes in the training dataset. Based on this scheme, each training or test enzyme was encoded into a 7-dimenion vector by fusing its functional domain information and above statistical results. Random k-labelsets (RAKEL) was adopted to build the classifier, where random forest was selected as the base classification algorithm. The two tenfold cross-validation results on the training dataset shown that the accuracy of PredictEFC can reach 0.8493 and 0.8370. The independent test on two datasets indicated the accuracy values of 0.9118 and 0.8777. CONCLUSION The performance of PredictEFC was slightly lower than the classifier directly using functional domain composition. However, its efficiency was sharply improved. The running time was less than one-tenth of the time of the classifier directly using functional domain composition. In additional, the utility of PredictEFC was superior to the classifiers using traditional dimensionality reduction methods and some previous methods, and this classifier can be transplanted for predicting enzyme family classes of other species. Finally, a web-server available at http://124.221.158.221/ was set up for easy usage.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China.
| | - Chenyu Zhang
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China
| | - Jing Xu
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China
| |
Collapse
|
9
|
Zhou B, Ran B, Chen L. A GraphSAGE-based model with fingerprints only to predict drug-drug interactions. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:2922-2942. [PMID: 38454713 DOI: 10.3934/mbe.2024130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/09/2024]
Abstract
Drugs are an effective way to treat various diseases. Some diseases are so complicated that the effect of a single drug for such diseases is limited, which has led to the emergence of combination drug therapy. The use multiple drugs to treat these diseases can improve the drug efficacy, but it can also bring adverse effects. Thus, it is essential to determine drug-drug interactions (DDIs). Recently, deep learning algorithms have become popular to design DDI prediction models. However, most deep learning-based models need several types of drug properties, inducing the application problems for drugs without these properties. In this study, a new deep learning-based model was designed to predict DDIs. For wide applications, drugs were first represented by commonly used properties, referred to as fingerprint features. Then, these features were perfectly fused with the drug interaction network by a type of graph convolutional network method, GraphSAGE, yielding high-level drug features. The inner product was adopted to score the strength of drug pairs. The model was evaluated by 10-fold cross-validation, resulting in an AUROC of 0.9704 and AUPR of 0.9727. Such performance was better than the previous model which directly used drug fingerprint features and was competitive compared with some other previous models that used more drug properties. Furthermore, the ablation tests indicated the importance of the main parts of the model, and we analyzed the strengths and limitations of a model for drugs with different degrees in the network. This model identified some novel DDIs that may bring expected benefits, such as the combination of PEA and cannabinol that may produce better effects. DDIs that may cause unexpected side effects have also been discovered, such as the combined use of WIN 55,212-2 and cannabinol. These DDIs can provide novel insights for treating complex diseases or avoiding adverse drug events.
Collapse
Affiliation(s)
- Bo Zhou
- Institute of Wound Prevention and Treatment, Shanghai University of Medicine and Health Sciences, Shanghai 201318, China
- School of Basic Medical Sciences, Shanghai University of Medicine and Health Sciences, Shanghai 201318, China
| | - Bing Ran
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
10
|
Ren JX, Chen L, Guo W, Feng KY, Cai YD, Huang T. Patterns of Gene Expression Profiles Associated with Colorectal Cancer in Colorectal Mucosa by Using Machine Learning Methods. Comb Chem High Throughput Screen 2024; 27:2921-2934. [PMID: 37957897 DOI: 10.2174/0113862073266300231026103844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Revised: 09/11/2023] [Accepted: 09/30/2023] [Indexed: 11/15/2023]
Abstract
BACKGROUND Colorectal cancer (CRC) has a very high incidence and lethality rate and is one of the most dangerous cancer types. Timely diagnosis can effectively reduce the incidence of colorectal cancer. Changes in para-cancerous tissues may serve as an early signal for tumorigenesis. Comparison of the differences in gene expression between para-cancerous and normal mucosa can help in the diagnosis of CRC and understanding the mechanisms of development. OBJECTIVES This study aimed to identify specific genes at the level of gene expression, which are expressed in normal mucosa and may be predictive of CRC risk. METHODS A machine learning approach was used to analyze transcriptomic data in 459 samples of normal colonic mucosal tissue from 322 CRC cases and 137 non-CRC, in which each sample contained 28,706 gene expression levels. The genes were ranked using four ranking methods based on importance estimation (LASSO, LightGBM, MCFS, and mRMR) and four classification algorithms (decision tree [DT], K-nearest neighbor [KNN], random forest [RF], and support vector machine [SVM]) were combined with incremental feature selection [IFS] methods to construct a prediction model with excellent performance. RESULT The top-ranked genes, namely, HOXD12, CDH1, and S100A12, were associated with tumorigenesis based on previous studies. CONCLUSION This study summarized four sets of quantitative classification rules based on the DT algorithm, providing clues for understanding the microenvironmental changes caused by CRC. According to the rules, the effect of CRC on normal mucosa can be determined.
Collapse
Affiliation(s)
- Jing Xin Ren
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai, 200030, China
| | - Kai Yan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, 510507, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| |
Collapse
|
11
|
Chen L, Qu R, Liu X. Improved multi-label classifiers for predicting protein subcellular localization. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:214-236. [PMID: 38303420 DOI: 10.3934/mbe.2024010] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
Protein functions are closely related to their subcellular locations. At present, the prediction of protein subcellular locations is one of the most important problems in protein science. The evident defects of traditional methods make it urgent to design methods with high efficiency and low costs. To date, lots of computational methods have been proposed. However, this problem is far from being completely solved. Recently, some multi-label classifiers have been proposed to identify subcellular locations of human, animal, Gram-negative bacterial and eukaryotic proteins. These classifiers adopted the protein features derived from gene ontology information. Although they provided good performance, they can be further improved by adopting more powerful machine learning algorithms. In this study, four improved multi-label classifiers were set up for identification of subcellular locations of the above four protein types. The random k-labelsets (RAKEL) algorithm was used to tackle proteins with multiple locations, and random forest was used as the basic prediction engine. All classifiers were tested by jackknife test, indicating their high performance. Comparisons with previous classifiers further confirmed the superiority of the proposed classifiers.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Ruyun Qu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Xintong Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
12
|
Lin X, Ma Q, Chen L, Guo W, Huang Z, Huang T, Cai YD. Identifying genes associated with resistance to KRAS G12C inhibitors via machine learning methods. Biochim Biophys Acta Gen Subj 2023; 1867:130484. [PMID: 37805078 DOI: 10.1016/j.bbagen.2023.130484] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 10/02/2023] [Accepted: 10/04/2023] [Indexed: 10/09/2023]
Abstract
BACKGROUND Targeted therapy has revolutionized cancer treatment, greatly improving patient outcomes and quality of life. Lung cancer, specifically non-small cell lung cancer, is frequently driven by the G12C mutation at the KRAS locus. The development of KRAS inhibitors has been a breakthrough in the field of cancer research, given the crucial role of KRAS mutations in driving tumor growth and progression. However, over half of patients with cancer bypass inhibition show limited response to treatment. The mechanisms underlying tumor cell resistance to this treatment remain poorly understood. METHODS To address above gap in knowledge, we conducted a study aimed to elucidate the differences between tumor cells that respond positively to KRAS (G12C) inhibitor therapy and those that do not. Specifically, we analyzed single-cell gene expression profiles from KRAS G12C-mutant tumor cell models (H358, H2122, and SW1573) treated with KRAS G12C (ARS-1620) inhibitor, which contained 4297 cells that continued to proliferate under treatment and 3315 cells that became quiescent. Each cell was represented by the expression levels on 8687 genes. We then designed an innovative machine learning based framework, incorporating seven feature ranking algorithms and four classification algorithms to identify essential genes and establish quantitative rules. RESULTS Our analysis identified some top-ranked genes, including H2AFZ, CKS1B, TUBA1B, RRM2, and BIRC5, that are known to be associated with the progression of multiple cancers. CONCLUSION Above genes were relevant to tumor cell resistance to targeted therapy. This study provides important insights into the molecular mechanisms underlying tumor cell resistance to KRAS inhibitor treatment.
Collapse
Affiliation(s)
- Xiandong Lin
- Laboratory of Radiation Oncology and Radiobiology, Clinical Oncology School of Fujian Medical University and Fujian Cancer Hospital, Fuzhou 350014, China.
| | - QingLan Ma
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai 200030, China
| | - Zhiyi Huang
- College of Chemistry, Fuzhou University, Fuzhou 350000, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China; CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| |
Collapse
|
13
|
Chen L, Zhao X. PCDA-HNMP: Predicting circRNA-disease association using heterogeneous network and meta-path. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:20553-20575. [PMID: 38124565 DOI: 10.3934/mbe.2023909] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
Increasing amounts of experimental studies have shown that circular RNAs (circRNAs) play important regulatory roles in human diseases through interactions with related microRNAs (miRNAs). CircRNAs have become new potential disease biomarkers and therapeutic targets. Predicting circRNA-disease association (CDA) is of great significance for exploring the pathogenesis of complex diseases, which can improve the diagnosis level of diseases and promote the targeted therapy of diseases. However, determination of CDAs through traditional clinical trials is usually time-consuming and expensive. Computational methods are now alternative ways to predict CDAs. In this study, a new computational method, named PCDA-HNMP, was designed. For obtaining informative features of circRNAs and diseases, a heterogeneous network was first constructed, which defined circRNAs, mRNAs, miRNAs and diseases as nodes and associations between them as edges. Then, a deep analysis was conducted on the heterogeneous network by extracting meta-paths connecting to circRNAs (diseases), thereby mining hidden associations between various circRNAs (diseases). These associations constituted the meta-path-induced networks for circRNAs and diseases. The features of circRNAs and diseases were derived from the aforementioned networks via mashup. On the other hand, miRNA-disease associations (mDAs) were employed to improve the model's performance. miRNA features were yielded from the meta-path-induced networks on miRNAs and circRNAs, which were constructed from the meta-paths connecting miRNAs and circRNAs in the heterogeneous network. A concatenation operation was adopted to build the features of CDAs and mDAs. Such representations of CDAs and mDAs were fed into XGBoost to set up the model. The five-fold cross-validation yielded an area under the curve (AUC) of 0.9846, which was better than those of some existing state-of-the-art methods. The employment of mDAs can really enhance the model's performance and the importance analysis on meta-path-induced networks shown that networks produced by the meta-paths containing validated CDAs provided the most important contributions.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Xiaoyu Zhao
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
14
|
Yang Y, Zhang Y, Ren J, Feng K, Li Z, Huang T, Cai Y. Identification of Colon Immune Cell Marker Genes Using Machine Learning Methods. Life (Basel) 2023; 13:1876. [PMID: 37763280 PMCID: PMC10532943 DOI: 10.3390/life13091876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2023] [Revised: 08/24/2023] [Accepted: 09/04/2023] [Indexed: 09/29/2023] Open
Abstract
Immune cell infiltration that occurs at the site of colon tumors influences the course of cancer. Different immune cell compositions in the microenvironment lead to different immune responses and different therapeutic effects. This study analyzed single-cell RNA sequencing data in a normal colon with the aim of screening genetic markers of 25 candidate immune cell types and revealing quantitative differences between them. The dataset contains 25 classes of immune cells, 41,650 cells in total, and each cell is expressed by 22,164 genes at the expression level. They were fed into a machine learning-based stream. The five feature ranking algorithms (last absolute shrinkage and selection operator, light gradient boosting machine, Monte Carlo feature selection, minimum redundancy maximum relevance, and random forest) were first used to analyze the importance of gene features, yielding five feature lists. Then, incremental feature selection and two classification algorithms (decision tree and random forest) were combined to filter the most important genetic markers from each list. For different immune cell subtypes, their marker genes, such as KLRB1 in CD4 T cells, RPL30 in B cell IGA plasma cells, and JCHAIN in IgG producing B cells, were identified. They were confirmed to be differentially expressed in different immune cells and involved in immune processes. In addition, quantitative rules were summarized by using the decision tree algorithm to distinguish candidate immune cell types. These results provide a reference for exploring the cell composition of the colon cancer microenvironment and for clinical immunotherapy.
Collapse
Affiliation(s)
- Yong Yang
- Qianwei Hospital of Jilin Province, Changchun 130012, China;
| | - Yuhang Zhang
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA;
| | - Jingxin Ren
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510507, China;
| | - Zhandong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun 130052, China;
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yudong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
| |
Collapse
|