1
|
Chen L, Lu Y, Xu J, Zhou B. Prediction of drug's anatomical therapeutic chemical (ATC) code by constructing biological profiles of ATC codes. BMC Bioinformatics 2025; 26:86. [PMID: 40119265 PMCID: PMC11927162 DOI: 10.1186/s12859-025-06102-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2024] [Accepted: 03/04/2025] [Indexed: 03/24/2025] Open
Abstract
BACKGROUND The Anatomical Therapeutic Chemical (ATC) classification system, proposed and maintained by the World Health Organization, is among the most widely used drug classification schemes. Recently, it has become a key research focus in drug repositioning. Computational models often pair drugs with ATC codes to explore drug-ATC code associations. However, the limited information available for ATC codes constrains these models, leaving significant room for improvement. RESULTS This study presents an inference method to identify highly related target proteins, structural features, and side effects for each ATC code, constructing comprehensive biological profiles. Association networks for target proteins, structural features, and side effects are established, and a random walk with restart algorithm is applied to these networks to extract raw associations. A permutation test is then conducted to exclude false positives, yielding robust biological profiles for ATC codes. These profiles are used to construct new ATC code kernels, which are integrated with ATC code kernels from the existing model PDATC-NCPMKL. The recommendation matrix is subsequently generated using the procedures of PDATC-NCPMKL. Cross-validation results demonstrate that the new model achieves AUROC and AUPR values exceeding 0.96. CONCLUSION The proposed model outperforms PDATC-NCPMKL and other previous models. Analysis of the contributions of the newly added ATC code kernels confirms the value of biological profiles in enhancing the prediction of drug-ATC code associations.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China.
| | - Yiwen Lu
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China
| | - Jing Xu
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China
| | - Bo Zhou
- School of Basic Medical Sciences, Shanghai University of Medicine and Health Sciences, Shanghai, 201318, People's Republic of China
| |
Collapse
|
2
|
Zhang W, Tian Q, Cao Y, Fan W, Jiang D, Wang Y, Li Q, Wei XY. GraphATC: advancing multilevel and multi-label anatomical therapeutic chemical classification via atom-level graph learning. Brief Bioinform 2025; 26:bbaf194. [PMID: 40285359 PMCID: PMC12031726 DOI: 10.1093/bib/bbaf194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2024] [Revised: 03/06/2025] [Accepted: 04/07/2025] [Indexed: 04/29/2025] Open
Abstract
The accurate categorization of compounds within the anatomical therapeutic chemical (ATC) system is fundamental for drug development and fundamental research. Although this area has garnered significant research focus for over a decade, the majority of prior studies have concentrated solely on the Level 1 labels defined by the World Health Organization (WHO), neglecting the labels of the remaining four levels. This narrow focus fails to address the true nature of the task as a multilevel, multi-label classification challenge. Moreover, existing benchmarks like Chen-2012 and ATC-SMILES have become outdated, lacking the incorporation of new drugs or updated properties of existing ones that have emerged in recent years and have been integrated into the WHO ATC system. To tackle these shortcomings, we present a comprehensive approach in this paper. Firstly, we systematically cleanse and enhance the drug dataset, expanding it to encompass all five levels through a rigorous cross-resource validation process involving KEGG, PubChem, ChEMBL, ChemSpider, and ChemicalBook. This effort culminates in the creation of a novel benchmark termed ATC-GRAPH. Secondly, we extend the classification task to encompass Level 2 and introduce graph-based learning techniques to provide more accurate representations of drug molecular structures. This approach not only facilitates the modeling of Polymers, Macromolecules, and Multi-Component drugs more precisely but also enhances the overall fidelity of the classification process. The efficacy of our proposed framework is validated through extensive experiments, establishing a new state-of-the-art methodology. To facilitate the replication of this study, we have made the benchmark dataset, source code, and web server openly accessible.
Collapse
Affiliation(s)
- Wengyu Zhang
- Department of Computer Science, Sichuan University, Chengdu 610065, China
- Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong
| | - Qi Tian
- Department of Computer Science, Sichuan University, Chengdu 610065, China
| | - Yi Cao
- Department of Computer Science, Sichuan University, Chengdu 610065, China
| | - Wenqi Fan
- Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong
| | | | - Yaowei Wang
- Peng Cheng Laboratory, Shenzhen 518000, China
| | - Qing Li
- Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong
| | - Xiao-Yong Wei
- Department of Computer Science, Sichuan University, Chengdu 610065, China
- Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong
| |
Collapse
|
3
|
Chen L, Gu J, Zhou B. PMiSLocMF: predicting miRNA subcellular localizations by incorporating multi-source features of miRNAs. Brief Bioinform 2024; 25:bbae386. [PMID: 39154195 PMCID: PMC11330342 DOI: 10.1093/bib/bbae386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 07/04/2024] [Accepted: 07/23/2024] [Indexed: 08/19/2024] Open
Abstract
The microRNAs (miRNAs) play crucial roles in several biological processes. It is essential for a deeper insight into their functions and mechanisms by detecting their subcellular localizations. The traditional methods for determining miRNAs subcellular localizations are expensive. The computational methods are alternative ways to quickly predict miRNAs subcellular localizations. Although several computational methods have been proposed in this regard, the incomplete representations of miRNAs in these methods left the room for improvement. In this study, a novel computational method for predicting miRNA subcellular localizations, named PMiSLocMF, was developed. As lots of miRNAs have multiple subcellular localizations, this method was a multi-label classifier. Several properties of miRNA, such as miRNA sequences, miRNA functional similarity, miRNA-disease, miRNA-drug, and miRNA-mRNA associations were adopted for generating informative miRNA features. To this end, powerful algorithms [node2vec and graph attention auto-encoder (GATE)] and one newly designed scheme were adopted to process above properties, producing five feature types. All features were poured into self-attention and fully connected layers to make predictions. The cross-validation results indicated the high performance of PMiSLocMF with accuracy higher than 0.83, average area under the receiver operating characteristic curve (AUC) and area under the precision-recall curve (AUPR) exceeding 0.90 and 0.77, respectively. Such performance was better than all previous methods based on the same dataset. Further tests proved that using all feature types can improve the performance of PMiSLocMF, and GATE and self-attention layer can help enhance the performance. Finally, we deeply analyzed the influence of miRNA associations with diseases, drugs, and mRNAs on PMiSLocMF. The dataset and codes are available at https://github.com/Gu20201017/PMiSLocMF.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, 1550 Haigang Avenue, Pudong New District, Shanghai 201306, China
| | - Jiahui Gu
- College of Information Engineering, Shanghai Maritime University, 1550 Haigang Avenue, Pudong New District, Shanghai 201306, China
| | - Bo Zhou
- School of Basic Medical Sciences, Shanghai University of Medicine and Health Sciences, 279 Zhouzhu Road, Pudong New District, Shanghai 201318, China
| |
Collapse
|
4
|
Fu K, Du C, Wang S, He H. Multi-View Multi-Label Fine-Grained Emotion Decoding From Human Brain Activity. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:9026-9040. [PMID: 36346867 DOI: 10.1109/tnnls.2022.3217767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Decoding emotional states from human brain activity play an important role in the brain-computer interfaces. Existing emotion decoding methods still have two main limitations: one is only decoding a single emotion category from a brain activity pattern and the decoded emotion categories are coarse-grained, which is inconsistent with the complex emotional expression of humans; the other is ignoring the discrepancy of emotion expression between the left and right hemispheres of the human brain. In this article, we propose a novel multi-view multi-label hybrid model for fine-grained emotion decoding (up to 80 emotion categories) which can learn the expressive neural representations and predict multiple emotional states simultaneously. Specifically, the generative component of our hybrid model is parameterized by a multi-view variational autoencoder, in which we regard the brain activity of left and right hemispheres and their difference as three distinct views and use the product of expert mechanism in its inference network. The discriminative component of our hybrid model is implemented by a multi-label classification network with an asymmetric focal loss. For more accurate emotion decoding, we first adopt a label-aware module for emotion-specific neural representation learning and then model the dependency of emotional states by a masked self-attention mechanism. Extensive experiments on two visually evoked emotional datasets show the superiority of our method.
Collapse
|
5
|
Xu J, Ruan X, Yang J, Hu B, Li S, Hu J. SME-MFP: A novel spatiotemporal neural network with multiangle initialization embedding toward multifunctional peptides prediction. Comput Biol Chem 2024; 109:108033. [PMID: 38412804 DOI: 10.1016/j.compbiolchem.2024.108033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Revised: 01/09/2024] [Accepted: 02/17/2024] [Indexed: 02/29/2024]
Abstract
As a promising alternative to conventional antibiotic drugs in the biomedical field, functional peptide has been widely used in disease treatment owing to its low toxicity, high absorption rate, and biological activity. Recently, several machine learning methods have been developed for functional peptide prediction. However, the main research heavily relies on statistical features and few consider multifunctional peptide identification. So, we propose SME-MFP, a novel predictor in the imbalanced multi-label functional peptide datasets. First, we employ physicochemical and evolutionary information to represent the peptide sequence's initialization features from multiple perspectives. Second, the features are fused and then put into spatial feature extractors, where the residual connection and multiscale convolutional neural network extract more discriminative features of different lengths' peptide sequences. Besides, we also design AFT-based temporal feature extractors to fully capture the global interactions of the sequences. Finally, devising a new loss to replace the traditional cross entropy loss to settle the class imbalance problems. The results show that our framework not only enhances the model's ability to capture sequence features effectively, but also accuracy improves by 3.89% over existing methods on public peptide datasets.
Collapse
Affiliation(s)
- Jing Xu
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
| | - Xiaoli Ruan
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China.
| | - Jing Yang
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
| | - Bingqi Hu
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
| | - Shaobo Li
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
| | - Jianjun Hu
- Department of Computer Science and Engineering, University of South Carolina, Columbia 29208, USA
| |
Collapse
|
6
|
Chen L, Xu J, Zhou Y. PDATC-NCPMKL: Predicting drug's Anatomical Therapeutic Chemical (ATC) codes based on network consistency projection and multiple kernel learning. Comput Biol Med 2024; 169:107862. [PMID: 38150886 DOI: 10.1016/j.compbiomed.2023.107862] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 11/19/2023] [Accepted: 12/17/2023] [Indexed: 12/29/2023]
Abstract
The development and discovery of new drugs is time-consuming and needs lots of human and material resources. Therefore, discovery of novel effects of existing drugs is an important alternative way, which can accelerate the process of designing "new" drugs. The anatomical Therapeutic Chemical (ATC) classification system recommended by World Health Organization (WHO) is a basic research area in this regard. A novel ATC code of an existing drug suggests its novel effects. Some computational models have been proposed, which can predict the drug-ATC code associations. However, their performance is not very high. There still exist spaces for improvement. In this study, a new recommendation system (named PDATC-NCPMKL), which incorporated network consistency projection and multi-kernel learning, was designed to identify drug-ATC code associations. For drugs or ATC codes, several kernels were constructed, which were fused by a multiple kernel learning method and an additional kernel integration scheme. To enhance the performance, the drug-ATC code association adjacency matrix was reformulated by a variant of weighted K nearest known neighbors (WKNKN). The reformulated adjacency matrix, drug and ATC code kernels were fed into network consistency projection to generate the association score matrix. The proposed recommendation system was tested on the ATC codes at the second, third and fourth levels in drug ATC classification system using ten-fold cross-validation. The results indicated that all AUROC and AUPR values were close to or exceeded 0.96. Such performance was higher than some existing computational models. Some additional tests were conducted to prove the utility of adjacency matrix reformulation and to analyze the importance of drug and ATC code kernels.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, China.
| | - Jing Xu
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, China.
| | - Yubin Zhou
- Department of Thoracic Surgery, Sichuan Provincial People's Hospital, University of Electronic Science and Technology of China, Chengdu, 610072, China.
| |
Collapse
|
7
|
Chen L, Zhang C, Xu J. PredictEFC: a fast and efficient multi-label classifier for predicting enzyme family classes. BMC Bioinformatics 2024; 25:50. [PMID: 38291384 PMCID: PMC10829269 DOI: 10.1186/s12859-024-05665-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Accepted: 01/22/2024] [Indexed: 02/01/2024] Open
Abstract
BACKGROUND Enzymes play an irreplaceable and important role in maintaining the lives of living organisms. The Enzyme Commission (EC) number of an enzyme indicates its essential functions. Correct identification of the first digit (family class) of the EC number for a given enzyme is a hot topic in the past twenty years. Several previous methods adopted functional domain composition to represent enzymes. However, it would lead to dimension disaster, thereby reducing the efficiency of the methods. On the other hand, most previous methods can only deal with enzymes belonging to one family class. In fact, several enzymes belong to two or more family classes. RESULTS In this study, a fast and efficient multi-label classifier, named PredictEFC, was designed. To construct this classifier, a novel feature extraction scheme was designed for processing functional domain information of enzymes, which counting the distribution of each functional domain entry across seven family classes in the training dataset. Based on this scheme, each training or test enzyme was encoded into a 7-dimenion vector by fusing its functional domain information and above statistical results. Random k-labelsets (RAKEL) was adopted to build the classifier, where random forest was selected as the base classification algorithm. The two tenfold cross-validation results on the training dataset shown that the accuracy of PredictEFC can reach 0.8493 and 0.8370. The independent test on two datasets indicated the accuracy values of 0.9118 and 0.8777. CONCLUSION The performance of PredictEFC was slightly lower than the classifier directly using functional domain composition. However, its efficiency was sharply improved. The running time was less than one-tenth of the time of the classifier directly using functional domain composition. In additional, the utility of PredictEFC was superior to the classifiers using traditional dimensionality reduction methods and some previous methods, and this classifier can be transplanted for predicting enzyme family classes of other species. Finally, a web-server available at http://124.221.158.221/ was set up for easy usage.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China.
| | - Chenyu Zhang
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China
| | - Jing Xu
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China
| |
Collapse
|
8
|
Zhou B, Ran B, Chen L. A GraphSAGE-based model with fingerprints only to predict drug-drug interactions. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:2922-2942. [PMID: 38454713 DOI: 10.3934/mbe.2024130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/09/2024]
Abstract
Drugs are an effective way to treat various diseases. Some diseases are so complicated that the effect of a single drug for such diseases is limited, which has led to the emergence of combination drug therapy. The use multiple drugs to treat these diseases can improve the drug efficacy, but it can also bring adverse effects. Thus, it is essential to determine drug-drug interactions (DDIs). Recently, deep learning algorithms have become popular to design DDI prediction models. However, most deep learning-based models need several types of drug properties, inducing the application problems for drugs without these properties. In this study, a new deep learning-based model was designed to predict DDIs. For wide applications, drugs were first represented by commonly used properties, referred to as fingerprint features. Then, these features were perfectly fused with the drug interaction network by a type of graph convolutional network method, GraphSAGE, yielding high-level drug features. The inner product was adopted to score the strength of drug pairs. The model was evaluated by 10-fold cross-validation, resulting in an AUROC of 0.9704 and AUPR of 0.9727. Such performance was better than the previous model which directly used drug fingerprint features and was competitive compared with some other previous models that used more drug properties. Furthermore, the ablation tests indicated the importance of the main parts of the model, and we analyzed the strengths and limitations of a model for drugs with different degrees in the network. This model identified some novel DDIs that may bring expected benefits, such as the combination of PEA and cannabinol that may produce better effects. DDIs that may cause unexpected side effects have also been discovered, such as the combined use of WIN 55,212-2 and cannabinol. These DDIs can provide novel insights for treating complex diseases or avoiding adverse drug events.
Collapse
Affiliation(s)
- Bo Zhou
- Institute of Wound Prevention and Treatment, Shanghai University of Medicine and Health Sciences, Shanghai 201318, China
- School of Basic Medical Sciences, Shanghai University of Medicine and Health Sciences, Shanghai 201318, China
| | - Bing Ran
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
9
|
Chen L, Qu R, Liu X. Improved multi-label classifiers for predicting protein subcellular localization. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:214-236. [PMID: 38303420 DOI: 10.3934/mbe.2024010] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
Protein functions are closely related to their subcellular locations. At present, the prediction of protein subcellular locations is one of the most important problems in protein science. The evident defects of traditional methods make it urgent to design methods with high efficiency and low costs. To date, lots of computational methods have been proposed. However, this problem is far from being completely solved. Recently, some multi-label classifiers have been proposed to identify subcellular locations of human, animal, Gram-negative bacterial and eukaryotic proteins. These classifiers adopted the protein features derived from gene ontology information. Although they provided good performance, they can be further improved by adopting more powerful machine learning algorithms. In this study, four improved multi-label classifiers were set up for identification of subcellular locations of the above four protein types. The random k-labelsets (RAKEL) algorithm was used to tackle proteins with multiple locations, and random forest was used as the basic prediction engine. All classifiers were tested by jackknife test, indicating their high performance. Comparisons with previous classifiers further confirmed the superiority of the proposed classifiers.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Ruyun Qu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Xintong Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
10
|
Chen L, Chen Y. RMTLysPTM: recognizing multiple types of lysine PTM sites by deep analysis on sequences. Brief Bioinform 2023; 25:bbad450. [PMID: 38066710 PMCID: PMC10783864 DOI: 10.1093/bib/bbad450] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2023] [Revised: 10/24/2023] [Accepted: 11/15/2023] [Indexed: 12/18/2023] Open
Abstract
Post-translational modification (PTM) occurs after a protein is translated from ribonucleic acid. It is an important living creature life phenomenon because it is implicated in almost all cellular processes. Identification of PTM sites from a given protein sequence is a hot topic in bioinformatics. Lots of computational methods have been proposed, and they provide good performance. However, most previous methods can only tackle one PTM type. Few methods consider multiple PTM types. In this study, a multi-label classification model, named RMTLysPTM, was developed to recognize four types of lysine (K) PTM sites, including acetylation, crotonylation, methylation and succinylation. The surrounding sites of a lysine site were selected to constitute a peptide segment, representing the lysine at the center. Deep analysis was conducted to count the distribution of 2-residues with fixed location across the four types of lysine PTM sites. By aggregating the distribution information of 2-residues in one peptide segment, the peptide segment was encoded by informative features. Furthermore, a prediction engine that can precisely capture the traits of the above representations was designed to recognize the types of lysine PTM sites. The cross-validation results on two datasets (Qiu and CPLM training datasets) suggested that the model had extremely high performance and RMTLysPTM had strong generalization ability by testing it on protein Q16778 and CPLM testing datasets. The model was found to be generally superior to all previous models and those using popular methods and features. A web server was set up for RMTLysPTM, and it can be accessed at http://119.3.127.138/.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People’s Republic of China
| | - Yuwei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People’s Republic of China
| |
Collapse
|
11
|
He C, Qu Y, Yin J, Zhao Z, Ma R, Duan L. Cross-view contrastive representation learning approach to predicting DTIs via integrating multi-source information. Methods 2023; 218:176-188. [PMID: 37586602 DOI: 10.1016/j.ymeth.2023.08.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Revised: 07/26/2023] [Accepted: 08/08/2023] [Indexed: 08/18/2023] Open
Abstract
Drug-target interaction (DTI) prediction serves as the foundation of new drug findings and drug repositioning. For drugs/targets, the sequence data contains the biological structural information, while the heterogeneous network contains the biochemical functional information. These two types of information describe different aspects of drugs and targets. Due to the complexity of DTI machinery, it is necessary to learn the representation from multiple perspectives. We hereby try to design a way to leverage information from multi-source data to the maximum extent and find a strategy to fuse them. To address the above challenges, we propose a model, named MOVE (short for integrating multi-source information for predicting DTI via cross-view contrastive learning), for learning comprehensive representations of each drug and target from multi-source data. MOVE extracts information from the sequence view and the network view, then utilizes a fusion module with auxiliary contrastive learning to facilitate the fusion of representations. Experimental results on the benchmark dataset demonstrate that MOVE is effective in DTI prediction.
Collapse
Affiliation(s)
- Chengxin He
- School of Computer Science, Sichuan University, Chengdu 610065, China; Med-X Center for Informatics, Sichuan University, Chengdu 610065, China
| | - Yuening Qu
- School of Computer Science, Sichuan University, Chengdu 610065, China
| | - Jin Yin
- The West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610065, China
| | - Zhenjiang Zhao
- School of Computer Science, Sichuan University, Chengdu 610065, China
| | - Runze Ma
- School of Computer Science, Sichuan University, Chengdu 610065, China
| | - Lei Duan
- School of Computer Science, Sichuan University, Chengdu 610065, China; Med-X Center for Informatics, Sichuan University, Chengdu 610065, China.
| |
Collapse
|
12
|
Zhao H, Duan G, Ni P, Yan C, Li Y, Wang J. RNPredATC: A Deep Residual Learning-Based Model With Applications to the Prediction of Drug-ATC Code Association. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2712-2723. [PMID: 34110998 DOI: 10.1109/tcbb.2021.3088256] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The Anatomical Therapeutic Chemical (ATC) classification system, designated by the World Health Organization Collaborating Center (WHOCC), has been widely used in drug screening, repositioning, and similarity research. The ATC classification system assigns different codes to drugs according to the organ or system on which they act and/or their therapeutic and chemical characteristics. Correctly identifying the potential ATC codes for drugs can accelerate drug development and reduce the cost of experiments. Several classifiers have been proposed in this regard. However, they lack of ability to learn basic features from sparsely known drug-ATC code associations. Therefore, there is an urgent need for novel computational methods to precisely predict potential drug-ATC code associations in multiple levels of the ATC classification system based on known associations between drugs and ATC codes. In this paper, we provide a novel end-to-end model, so-called RNPredATC, to predict potential drug-ATC code associations in five ATC classification levels. RNPredATC can extract dense feature vectors from sparsely known drug-ATC code associations and reduce the impact from the degradation problem by a novel deep residual learning. We extensively compare our method with some state-of-the-art methods, including NetPredATC, SPACE, and some multi-label-based methods. Our experimental results show that RNPredATC achieves better performances in five-fold and ten-fold cross validations. Furthermore, the visualization analysis of hidden layers and case studies of predicted associations at the fifth ATC classification level confirm that RNPredATC can effectively identify the potential ATC codes of drugs.
Collapse
|
13
|
Wu C, Chen L. A model with deep analysis on a large drug network for drug classification. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:383-401. [PMID: 36650771 DOI: 10.3934/mbe.2023018] [Citation(s) in RCA: 25] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Drugs are an important means to treat various diseases. They are classified into several classes to indicate their properties and effects. Those in the same class always share some important features. The Kyoto Encyclopedia of Genes and Genomes (KEGG) DRUG recently reported a new drug classification system that classifies drugs into 14 classes. Correct identification of the class for any possible drug-like compound is helpful to roughly determine its effects for a particular type of disease. Experiments could be conducted to confirm such latent effects, thus accelerating the procedures for discovering novel drugs. In this study, this classification system was investigated. A classification model was proposed to assign one of the classes in the system to any given drug for the first time. Different from traditional fingerprint features, which indicated essential drug properties alone and were very popular in investigating drug-related problems, drugs were represented by novel features derived from a large drug network via a well-known network embedding algorithm called Node2vec. These features abstracted the drug associations generated from their essential properties, and they could overview each drug with all drugs as background. As class sizes were of great differences, synthetic minority over-sampling technique (SMOTE) was employed to tackle the imbalance problem. A balanced dataset was fed into the support vector machine to build the model. The 10-fold cross-validation results suggested the excellent performance of the model. This model was also superior to models using other drug features, including those generated by another network embedding algorithm and fingerprint features. Furthermore, this model provided more balanced performance across all classes than that without SMOTE.
Collapse
Affiliation(s)
- Chenhao Wu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
14
|
Yi W, Sun A, Liu M, Liu X, Zhang W, Dai Q. Comparative Study on Feature Selection in Protein Structure and Function Prediction. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:1650693. [PMID: 36267316 PMCID: PMC9578875 DOI: 10.1155/2022/1650693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Accepted: 09/14/2022] [Indexed: 11/18/2022]
Abstract
Many effective methods extract and fuse different protein features to study the relationship between protein sequence, structure, and function, but different methods have preferences in solving the research of protein structure and function, which requires selecting valuable and contributing features to design more effective prediction methods. This work mainly focused on the feature selection methods in the study of protein structure and function, and systematically compared and analyzed the efficiency of different feature selection methods in the prediction of protein structures, protein disorders, protein molecular chaperones, and protein solubility. The results show that the feature selection method based on nonlinear SVM performs best in protein structure prediction, protein solubility prediction, protein molecular chaperone prediction, and protein solubility prediction. After selection, the accuracy of features is improved by 13.16% ~71%, especially the Kmer features and PSSM features of proteins.
Collapse
Affiliation(s)
- Wenjing Yi
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Ao Sun
- College of Informatics Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Manman Liu
- College of Informatics Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Xiaoqing Liu
- College of Sciences, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Wei Zhang
- College of Informatics Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| |
Collapse
|
15
|
Liu X, Yi W, Xi B, Dai Q. Identification of Drug-Disease Associations Using a Random Walk with Restart Method and Supervised Learning. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:7035634. [PMID: 36262874 PMCID: PMC9576438 DOI: 10.1155/2022/7035634] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/24/2022] [Accepted: 09/23/2022] [Indexed: 11/17/2022]
Abstract
Drug-disease correlations play an important role in revealing the mechanism of disease, finding new indications of available drugs, or drug repositioning. A variety of computational approaches were proposed to find drug-disease correlations and achieve good performances. However, these methods used a variety of network information, but integrated networks were rarely used. In addition, the role of known drug-disease association data has not been fully played. In this work, we designed a combination algorithm of random walk and supervised learning to find the drug-disease correlations. We used an integrated network to update the model and selected a gene set as the start of random walk based on the known drug-disease correlations data. The experimental results show that the proposed method can effectively find the correlation between drugs and diseases, and the prediction accuracy is 82.7%. We found that there are 8 pairs of drug-disease relationships that have not yet been reported, and 5 of them have pharmacodynamic effects on Parkinson's disease. We also found that a key linkage between Parkinson's disease and phenylhexol, a drug for the treatment of Parkinson's disease α-synuclein and tau protein, provides a useful exploration for the effectiveness of the treatment of Parkinson's disease.
Collapse
Affiliation(s)
- Xiaoqing Liu
- College of Sciences, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Wenjing Yi
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Baohang Xi
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| |
Collapse
|
16
|
Liu Z, Meng M, Ding S, Zhou X, Feng K, Huang T, Cai YD. Identification of methylation signatures and rules for predicting the severity of SARS-CoV-2 infection with machine learning methods. Front Microbiol 2022; 13:1007295. [PMID: 36212830 PMCID: PMC9537378 DOI: 10.3389/fmicb.2022.1007295] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2022] [Accepted: 09/01/2022] [Indexed: 11/17/2022] Open
Abstract
Patients infected with SARS-CoV-2 at various severities have different clinical manifestations and treatments. Mild or moderate patients usually recover with conventional medical treatment, but severe patients require prompt professional treatment. Thus, stratifying infected patients for targeted treatment is meaningful. A computational workflow was designed in this study to identify key blood methylation features and rules that can distinguish the severity of SARS-CoV-2 infection. First, the methylation features in the expression profile were deeply analyzed by a Monte Carlo feature selection method. A feature list was generated. Next, this ranked feature list was fed into the incremental feature selection method to determine the optimal features for different classification algorithms, thereby further building optimal classifiers. These selected key features were analyzed by functional enrichment to detect their biofunctional information. Furthermore, a set of rules were set up by a white-box algorithm, decision tree, to uncover different methylation patterns on various severity of SARS-CoV-2 infection. Some genes (PARP9, MX1, IRF7), corresponding to essential methylation sites, and rules were validated by published academic literature. Overall, this study contributes to revealing potential expression features and provides a reference for patient stratification. The physicians can prioritize and allocate health and medical resources for COVID-19 patients based on their predicted severe clinical outcomes.
Collapse
Affiliation(s)
- Zhiyang Liu
- School of Life Sciences, Changchun Sci-Tech University, Changchun, China
| | - Mei Meng
- State Key Laboratory of Oncogenes and Related Genes, Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - ShiJian Ding
- School of Life Sciences, Shanghai University, Shanghai, China
| | - XiaoChao Zhou
- State Key Laboratory of Oncogenes and Related Genes, Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- *Correspondence: Tao Huang,
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- Yu-Dong Cai,
| |
Collapse
|
17
|
Subcellular Localization Prediction of Human Proteins Using Multifeature Selection Methods. BIOMED RESEARCH INTERNATIONAL 2022; 2022:3288527. [PMID: 36132086 PMCID: PMC9484878 DOI: 10.1155/2022/3288527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Accepted: 08/30/2022] [Indexed: 11/25/2022]
Abstract
Subcellular localization attempts to assign proteins to one of the cell compartments that performs specific biological functions. Finding the link between proteins, biological functions, and subcellular localization is an effective way to investigate the general organization of living cells in a systematic manner. However, determining the subcellular localization of proteins by traditional experimental approaches is difficult. Here, protein–protein interaction networks, functional enrichment on gene ontology and pathway, and a set of proteins having confirmed subcellular localization were applied to build prediction models for human protein subcellular localizations. To build an effective predictive model, we employed a variety of robust machine learning algorithms, including Boruta feature selection, minimum redundancy maximum relevance, Monte Carlo feature selection, and LightGBM. Then, the incremental feature selection method with random forest and support vector machine was used to discover the essential features. Furthermore, 38 key features were determined by integrating results of different feature selection methods, which may provide critical insights into the subcellular location of proteins. Their biological functions of subcellular localizations were discussed according to recent publications. In summary, our computational framework can help advance the understanding of subcellular localization prediction techniques and provide a new perspective to investigate the patterns of protein subcellular localization and their biological importance.
Collapse
|
18
|
Lu S, Wang H, Zhang J. Identification of uveitis-associated functions based on the feature selection analysis of gene ontology and Kyoto Encyclopedia of Genes and Genomes pathway enrichment scores. Front Mol Neurosci 2022; 15:1007352. [PMID: 36157069 PMCID: PMC9493498 DOI: 10.3389/fnmol.2022.1007352] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2022] [Accepted: 08/22/2022] [Indexed: 11/13/2022] Open
Abstract
Uveitis is a typical type of eye inflammation affecting the middle layer of eye (i.e., uvea layer) and can lead to blindness in middle-aged and young people. Therefore, a comprehensive study determining the disease susceptibility and the underlying mechanisms for uveitis initiation and progression is urgently needed for the development of effective treatments. In the present study, 108 uveitis-related genes are collected on the basis of literature mining, and 17,560 other human genes are collected from the Ensembl database, which are treated as non-uveitis genes. Uveitis- and non-uveitis-related genes are then encoded by gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment scores based on the genes and their neighbors in STRING, resulting in 20,681 GO term features and 297 KEGG pathway features. Subsequently, we identify functions and biological processes that can distinguish uveitis-related genes from other human genes by using an integrated feature selection method, which incorporate feature filtering method (Boruta) and four feature importance assessment methods (i.e., LASSO, LightGBM, MCFS, and mRMR). Some essential GO terms and KEGG pathways related to uveitis, such as GO:0001841 (neural tube formation), has04612 (antigen processing and presentation in human beings), and GO:0043379 (memory T cell differentiation), are identified. The plausibility of the association of mined functional features with uveitis is verified on the basis of the literature. Overall, several advanced machine learning methods are used in the current study to uncover specific functions of uveitis and provide a theoretical foundation for the clinical treatment of uveitis.
Collapse
Affiliation(s)
- Shiheng Lu
- Department of Ophthalmology, Shanghai Eye Disease Prevention and Treatment Center, Shanghai Eye Hospital, Shanghai, China
- Department of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- Shanghai Key Laboratory of Ocular Fundus Diseases, Shanghai, China
- Shanghai Engineering Center for Visual Science and Photomedicine, Shanghai, China
- National Clinical Research Center for Eye Diseases, Shanghai, China
- Shanghai Engineering Research Center for Precise Diagnosis and Treatment of Eye Diseases, Shanghai, China
- *Correspondence: Shiheng Lu,
| | - Hui Wang
- Department of Orthopedics, Shanghai Yangpu Hospital of Traditional Chinese Medicine, Shanghai, China
| | - Jian Zhang
- Department of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- Shanghai Key Laboratory of Ocular Fundus Diseases, Shanghai, China
- Shanghai Engineering Center for Visual Science and Photomedicine, Shanghai, China
- National Clinical Research Center for Eye Diseases, Shanghai, China
- Shanghai Engineering Research Center for Precise Diagnosis and Treatment of Eye Diseases, Shanghai, China
- Jian Zhang,
| |
Collapse
|
19
|
Li Z, Wang D, Guo W, Zhang S, Chen L, Zhang YH, Lu L, Pan X, Huang T, Cai YD. Identification of cortical interneuron cell markers in mouse embryos based on machine learning analysis of single-cell transcriptomics. Front Neurosci 2022; 16:841145. [PMID: 35911980 PMCID: PMC9337837 DOI: 10.3389/fnins.2022.841145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Accepted: 06/28/2022] [Indexed: 11/13/2022] Open
Abstract
Mammalian cortical interneurons (CINs) could be classified into more than two dozen cell types that possess diverse electrophysiological and molecular characteristics, and participate in various essential biological processes in the human neural system. However, the mechanism to generate diversity in CINs remains controversial. This study aims to predict CIN diversity in mouse embryo by using single-cell transcriptomics and the machine learning methods. Data of 2,669 single-cell transcriptome sequencing results are employed. The 2,669 cells are classified into three categories, caudal ganglionic eminence (CGE) cells, dorsal medial ganglionic eminence (dMGE) cells, and ventral medial ganglionic eminence (vMGE) cells, corresponding to the three regions in the mouse subpallium where the cells are collected. Such transcriptomic profiles were first analyzed by the minimum redundancy and maximum relevance method. A feature list was obtained, which was further fed into the incremental feature selection, incorporating two classification algorithms (random forest and repeated incremental pruning to produce error reduction), to extract key genes and construct powerful classifiers and classification rules. The optimal classifier could achieve an MCC of 0.725, and category-specified prediction accuracies of 0.958, 0.760, and 0.737 for the CGE, dMGE, and vMGE cells, respectively. The related genes and rules may provide helpful information for deepening the understanding of CIN diversity.
Collapse
Affiliation(s)
- Zhandong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Deling Wang
- State Key Laboratory of Oncology in South China, Department of Radiology, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Shiqi Zhang
- Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Harvard Medical School, Brigham and Women’s Hospital, Boston, MA, United States
| | - Lin Lu
- Department of Radiology, Columbia University Irving Medical Center, New York, NY, United States
| | - XiaoYong Pan
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China
| | - Tao Huang
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- *Correspondence: Tao Huang,
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- Yu-Dong Cai,
| |
Collapse
|
20
|
Yan C, Suo Z, Wang J, Zhang G, Luo H. DACPGTN: Drug ATC Code Prediction Method Based on Graph Transformer Network for Drug Discovery. Front Pharmacol 2022; 13:907676. [PMID: 35721178 PMCID: PMC9198367 DOI: 10.3389/fphar.2022.907676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Accepted: 05/06/2022] [Indexed: 11/21/2022] Open
Abstract
The Anatomical Therapeutic Chemical (ATC) classification system is a drug classification scheme proposed by the World Health Organization, which is widely used for drug screening, repositioning, and similarity research. The ATC system assigns different ATC codes to drugs based on their anatomy, pharmacological, therapeutics and chemical properties. Predicting the ATC code of a given drug helps to understand the indication and potential toxicity of the drug, thus promoting its use in the therapeutic phase and accelerating its development. In this article, we propose an end-to-end model DACPGTN to predict the ATC code for the given drug. DACPGTN constructs composite features of drugs, diseases and targets by applying diverse biomedical information. Inspired by the application of Graph Transformer Network, we learn potential novel interactions among drugs diseases and targets from the known interactions to construct drug-target-disease heterogeneous networks containing comprehensive interaction information. Based on the constructed composite features and learned heterogeneous networks, we employ graph convolution network to generate the embedding of drug nodes, which are further used for the multi-label learning tasks in drug discovery. Experiments on the benchmark datasets demonstrate that the proposed DACPGTN model can achieve better prediction performance than the existing methods. The source codes of our method are available at https://github.com/Szhgege/DACPGTN.
Collapse
Affiliation(s)
- Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China.,Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, China
| | - Zhihao Suo
- School of Computer and Information Engineering, Henan University, Kaifeng, China.,Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, China
| | - Jianlin Wang
- School of Computer and Information Engineering, Henan University, Kaifeng, China.,Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, China
| | - Ge Zhang
- School of Computer and Information Engineering, Henan University, Kaifeng, China.,Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China.,Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, China
| |
Collapse
|
21
|
Li H, Zhang S, Chen L, Pan X, Li Z, Huang T, Cai YD. Identifying Functions of Proteins in Mice With Functional Embedding Features. Front Genet 2022; 13:909040. [PMID: 35651937 PMCID: PMC9149260 DOI: 10.3389/fgene.2022.909040] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Accepted: 04/28/2022] [Indexed: 12/02/2022] Open
Abstract
In current biology, exploring the biological functions of proteins is important. Given the large number of proteins in some organisms, exploring their functions one by one through traditional experiments is impossible. Therefore, developing quick and reliable methods for identifying protein functions is necessary. Considerable accumulation of protein knowledge and recent developments on computer science provide an alternative way to complete this task, that is, designing computational methods. Several efforts have been made in this field. Most previous methods have adopted the protein sequence features or directly used the linkage from a protein–protein interaction (PPI) network. In this study, we proposed some novel multi-label classifiers, which adopted new embedding features to represent proteins. These features were derived from functional domains and a PPI network via word embedding and network embedding, respectively. The minimum redundancy maximum relevance method was used to assess the features, generating a feature list. Incremental feature selection, incorporating RAndom k-labELsets to construct multi-label classifiers, used such list to construct two optimum classifiers, corresponding to two key measurements: accuracy and exact match. These two classifiers had good performance, and they were superior to classifiers that used features extracted by traditional methods.
Collapse
Affiliation(s)
- Hao Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - ShiQi Zhang
- Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - ZhanDong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China.,CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
22
|
Li Z, Pan X, Cai YD. Identification of Type 2 Diabetes Biomarkers From Mixed Single-Cell Sequencing Data With Feature Selection Methods. Front Bioeng Biotechnol 2022; 10:890901. [PMID: 35721855 PMCID: PMC9201257 DOI: 10.3389/fbioe.2022.890901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 04/04/2022] [Indexed: 11/18/2022] Open
Abstract
Diabetes is the most common disease and a major threat to human health. Type 2 diabetes (T2D) makes up about 90% of all cases. With the development of high-throughput sequencing technologies, more and more fundamental pathogenesis of T2D at genetic and transcriptomic levels has been revealed. The recent single-cell sequencing can further reveal the cellular heterogenicity of complex diseases in an unprecedented way. With the expectation on the molecular essence of T2D across multiple cell types, we investigated the expression profiling of more than 1,600 single cells (949 cells from T2D patients and 651 cells from normal controls) and identified the differential expression profiling and characteristics at the transcriptomics level that can distinguish such two groups of cells at the single-cell level. The expression profile was analyzed by several machine learning algorithms, including Monte Carlo feature selection, support vector machine, and repeated incremental pruning to produce error reduction (RIPPER). On one hand, some T2D-associated genes (MTND4P24, MTND2P28, and LOC100128906) were discovered. On the other hand, we revealed novel potential pathogenic mechanisms in a rule manner. They are induced by newly recognized genes and neglected by traditional bulk sequencing techniques. Particularly, the newly identified T2D genes were shown to follow specific quantitative rules with diabetes prediction potentials, and such rules further indicated several potential functional crosstalks involved in T2D.
Collapse
Affiliation(s)
- Zhandong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Xiaoyong Pan
- Key Laboratory of System Control and Information Processing, Institute of Image Processing and Pattern Recognition, Ministry of Education of China, Shanghai Jiao Tong University, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- *Correspondence: Yu-Dong Cai,
| |
Collapse
|
23
|
Ran B, Chen L, Li M, Han Y, Dai Q. Drug-Drug Interactions Prediction Using Fingerprint Only. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:7818480. [PMID: 35586666 PMCID: PMC9110191 DOI: 10.1155/2022/7818480] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Accepted: 04/21/2022] [Indexed: 12/27/2022]
Abstract
Combination drug therapy is an efficient way to treat complicated diseases. Drug-drug interaction (DDI) is an important research topic in this therapy as patient safety is a problem when two or more drugs are taken at the same time. Traditionally, in vitro experiments and clinical trials are common ways to determine DDIs. However, these methods cannot meet the requirements of large-scale tests. It is an alternative way to develop computational methods for predicting DDIs. Although several previous methods have been proposed, they always need several types of drug information, limiting their applications. In this study, we proposed a simple computational method to predict DDIs. In this method, drugs were represented by their fingerprint features, which are most widely used in investigating drug-related problems. These features were refined by three models, including addition, subtraction, and Hadamard models, to generate the representation of DDIs. The powerful classification algorithm, random forest, was picked up to build the classifier. The results of two types of tenfold cross-validation on the classifier indicated good performance for discovering novel DDIs among known drugs and acceptable performance for identifying DDIs between known drugs and unknown drugs or among unknown drugs. Although the classifier adopted a sample scheme to represent DDIs, it was still superior to other methods, which adopted features generated by some advanced computer algorithms. Furthermore, a user-friendly web-server, named DDIPF (http://106.14.164.77:5004/DDIPF/), was developed to implement the classifier.
Collapse
Affiliation(s)
- Bing Ran
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Meijing Li
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Yujuan Han
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| |
Collapse
|
24
|
Li Z, Guo W, Zeng T, Yin J, Feng K, Huang T, Cai YD. Detecting Brain Structure-Specific Methylation Signatures and Rules for Alzheimer's Disease. Front Neurosci 2022; 16:895181. [PMID: 35585924 PMCID: PMC9108872 DOI: 10.3389/fnins.2022.895181] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2022] [Accepted: 04/11/2022] [Indexed: 01/01/2023] Open
Abstract
Alzheimer's disease (AD) is a progressive disease that leads to irreversible behavioral changes, erratic emotions, and loss of motor skills. These conditions make people with AD hard or almost impossible to take care of. Multiple internal and external pathological factors may affect or even trigger the initiation and progression of AD. DNA methylation is one of the most effective regulatory roles during AD pathogenesis, and pathological methylation alterations may be potentially different in the various brain structures of people with AD. Although multiple loci associated with AD initiation and progression have been identified, the spatial distribution patterns of AD-associated DNA methylation in the brain have not been clarified. According to the systematic methylation profiles on different structural brain regions, we applied multiple machine learning algorithms to investigate such profiles. First, the profile on each brain region was analyzed by the Boruta feature filtering method. Some important methylation features were extracted and further analyzed by the max-relevance and min-redundancy method, resulting in a feature list. Then, the incremental feature selection method, incorporating some classification algorithms, adopted such list to identify candidate AD-associated loci at methylation with structural specificity, establish a group of quantitative rules for revealing the effects of DNA methylation in various brain regions (i.e., four brain structures) on AD pathogenesis. Furthermore, some efficient classifiers based on essential methylation sites were proposed to identify AD samples. Results revealed that methylation alterations in different brain structures have different contributions to AD pathogenesis. This study further illustrates the complex pathological mechanisms of AD.
Collapse
Affiliation(s)
- ZhanDong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
| | - Jie Yin
- Cancer Institute, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Department of Human Genetics, Institute of Genetics, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, China
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
25
|
Similarity-Based Method with Multiple-Feature Sampling for Predicting Drug Side Effects. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:9547317. [PMID: 35401786 PMCID: PMC8993545 DOI: 10.1155/2022/9547317] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Revised: 09/18/2021] [Accepted: 03/15/2022] [Indexed: 12/23/2022]
Abstract
Drugs can treat different diseases but also bring side effects. Undetected and unaccepted side effects for approved drugs can greatly harm the human body and bring huge risks for pharmaceutical companies. Traditional experimental methods used to determine the side effects have several drawbacks, such as low efficiency and high cost. One alternative to achieve this purpose is to design computational methods. Previous studies modeled a binary classification problem by pairing drugs and side effects; however, their classifiers can only extract one feature from each type of drug association. The present work proposed a novel multiple-feature sampling scheme that can extract several features from one type of drug association. Thirteen classification algorithms were employed to construct classifiers with features yielded by such scheme. Their performance was greatly improved compared with that of the classifiers that use the features yielded by the original scheme. Best performance was observed for the classifier based on random forest with MCC of 0.8661, AUROC of 0.969, and AUPR of 0.977. Finally, one key parameter in the multiple-feature sampling scheme was analyzed.
Collapse
|
26
|
Li X, Lu L, Chen L. Identification of protein functions in mouse with a label space partition method. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:3820-3842. [PMID: 35341276 DOI: 10.3934/mbe.2022176] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Protein is very important for almost all living creatures because it participates in most complicated and essential biological processes. Determining the functions of given proteins is one of the most essential problems in protein science. Such determination can be conducted through traditional experiments. However, the experimental methods are always time-consuming and of high costs. In recent years, computational methods give useful aids for identification of protein functions. This study presented a new multi-label classifier for identifying functions of mouse proteins. Due to the number of functional types, which were termed as labels in the classification procedure, a label space partition method was employed to divide labels into some partitions. On each partition, a multi-label classifier was constructed. The classifiers based on all partitions were integrated in the proposed classifier. The cross-validation results proved that the proposed classifier was of good performance. Classifiers with label partition were superior to those without label partition or with random label partition.
Collapse
Affiliation(s)
- Xuan Li
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lin Lu
- Department of Radiology, Columbia University Medical Center, New York 10032, USA
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
27
|
Ding S, Li H, Zhang YH, Zhou X, Feng K, Li Z, Chen L, Huang T, Cai YD. Identification of Pan-Cancer Biomarkers Based on the Gene Expression Profiles of Cancer Cell Lines. Front Cell Dev Biol 2021; 9:781285. [PMID: 34917619 PMCID: PMC8669964 DOI: 10.3389/fcell.2021.781285] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 11/16/2021] [Indexed: 12/12/2022] Open
Abstract
There are many types of cancers. Although they share some hallmarks, such as proliferation and metastasis, they are still very different from many perspectives. They grow on different organ or tissues. Does each cancer have a unique gene expression pattern that makes it different from other cancer types? After the Cancer Genome Atlas (TCGA) project, there are more and more pan-cancer studies. Researchers want to get robust gene expression signature from pan-cancer patients. But there is large variance in cancer patients due to heterogeneity. To get robust results, the sample size will be too large to recruit. In this study, we tried another approach to get robust pan-cancer biomarkers by using the cell line data to reduce the variance. We applied several advanced computational methods to analyze the Cancer Cell Line Encyclopedia (CCLE) gene expression profiles which included 988 cell lines from 20 cancer types. Two feature selection methods, including Boruta, and max-relevance and min-redundancy methods, were applied to the cell line gene expression data one by one, generating a feature list. Such list was fed into incremental feature selection method, incorporating one classification algorithm, to extract biomarkers, construct optimal classifiers and decision rules. The optimal classifiers provided good performance, which can be useful tools to identify cell lines from different cancer types, whereas the biomarkers (e.g. NCKAP1, TNFRSF12A, LAMB2, FKBP9, PFN2, TOM1L1) and rules identified in this work may provide a meaningful and precise reference for differentiating multiple types of cancer and contribute to the personalized treatment of tumors.
Collapse
Affiliation(s)
- ShiJian Ding
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, United States
| | - XianChao Zhou
- Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, China
| | - ZhanDong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Tao Huang
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China.,CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
28
|
iMPT-FDNPL: Identification of Membrane Protein Types with Functional Domains and a Natural Language Processing Approach. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:7681497. [PMID: 34671418 PMCID: PMC8523280 DOI: 10.1155/2021/7681497] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/22/2021] [Revised: 09/15/2021] [Accepted: 09/27/2021] [Indexed: 12/20/2022]
Abstract
Membrane protein is an important kind of proteins. It plays essential roles in several cellular processes. Based on the intramolecular arrangements and positions in a cell, membrane proteins can be divided into several types. It is reported that the types of a membrane protein are highly related to its functions. Determination of membrane protein types is a hot topic in recent years. A plenty of computational methods have been proposed so far. Some of them used functional domain information to encode proteins. However, this procedure was still crude. In this study, we designed a novel feature extraction scheme to obtain informative features of proteins from their functional domain information. Such scheme termed domains as words and proteins, represented by its domains, as sentences. The natural language processing approach, word2vector, was applied to access the features of domains, which were further refined to protein features. Based on these features, RAndom k-labELsets with random forest as the base classifier was employed to build the multilabel classifier, namely, iMPT-FDNPL. The tenfold cross-validation results indicated the good performance of such classifier. Furthermore, such classifier was superior to other classifiers based on features derived from functional domains via one-hot scheme or derived from other properties of proteins, suggesting the effectiveness of protein features generated by the proposed scheme.
Collapse
|
29
|
Identification of Novel Choroidal Neovascularization-Related Genes Using Laplacian Heat Diffusion Algorithm. BIOMED RESEARCH INTERNATIONAL 2021; 2021:2295412. [PMID: 34532497 PMCID: PMC8440095 DOI: 10.1155/2021/2295412] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Accepted: 08/20/2021] [Indexed: 11/20/2022]
Abstract
Choroidal neovascularization (CNV) is a type of eye disease that can cause vision loss. In recent years, many studies have attempted to investigate the major pathological processes and molecular pathogenic mechanisms of CNV. Because many diseases are related to genes, the genes associated with CNV need to be identified. In this study, we proposed a network-based approach for identifying novel CNV-associated genes. To execute such method, we first employed a protein-protein interaction network reported in STRING. Then, we applied a network diffusion algorithm, Laplacian heat diffusion, on this network by selecting validated CNV-related genes as the seed nodes. As a result, some novel genes that had unknown but strong relationships with validated genes were identified. Furthermore, we used a screening procedure to extract the most essential genes. Eleven latent CNV-related genes were finally obtained. Extensive analyses were performed to confirm that these genes are novel CNV-related genes.
Collapse
|
30
|
Chen L, Zhou X, Zeng T, Pan X, Zhang YH, Huang T, Fang Z, Cai YD. Recognizing Pattern and Rule of Mutation Signatures Corresponding to Cancer Types. Front Cell Dev Biol 2021; 9:712931. [PMID: 34513841 PMCID: PMC8427289 DOI: 10.3389/fcell.2021.712931] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Accepted: 07/02/2021] [Indexed: 11/20/2022] Open
Abstract
Cancer has been generally defined as a cluster of systematic malignant pathogenesis involving abnormal cell growth. Genetic mutations derived from environmental factors and inherited genetics trigger the initiation and progression of cancers. Although several well-known factors affect cancer, mutation features and rules that affect cancers are relatively unknown due to limited related studies. In this study, a computational investigation on mutation profiles of cancer samples in 27 types was given. These profiles were first analyzed by the Monte Carlo Feature Selection (MCFS) method. A feature list was thus obtained. Then, the incremental feature selection (IFS) method adopted such list to extract essential mutation features related to 27 cancer types, find out 207 mutation rules and construct efficient classifiers. The top 37 mutation features corresponding to different cancer types were discussed. All the qualitatively analyzed gene mutation features contribute to the distinction of different types of cancers, and most of such mutation rules are supported by recent literature. Therefore, our computational investigation could identify potential biomarkers and prediction rules for cancers in the mutation signature level.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai, China.,College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Xianchao Zhou
- School of Life Sciences and Technology, ShanghaiTech University, Shanghai, China.,Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Tao Zeng
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Xiaoyong Pan
- Key Laboratory of System Control and Information Processing, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Ministry of Education of China, Shanghai, China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, United States
| | - Tao Huang
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China.,Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Zhaoyuan Fang
- Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Haining, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
31
|
Wang X, Liu M, Zhang Y, He S, Qin C, Li Y, Lu T. Deep fusion learning facilitates anatomical therapeutic chemical recognition in drug repurposing and discovery. Brief Bioinform 2021; 22:6342939. [PMID: 34368838 DOI: 10.1093/bib/bbab289] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Revised: 07/03/2021] [Accepted: 07/06/2021] [Indexed: 01/17/2023] Open
Abstract
The advent of large-scale biomedical data and computational algorithms provides new opportunities for drug repurposing and discovery. It is of great interest to find an appropriate data representation and modeling method to facilitate these studies. The anatomical therapeutic chemical (ATC) classification system, proposed by the World Health Organization (WHO), is an essential source of information for drug repurposing and discovery. Besides, computational methods are applied to predict drug ATC classification. We conducted a systematic review of ATC computational prediction studies and revealed the differences in data sets, data representation, algorithm approaches, and evaluation metrics. We then proposed a deep fusion learning (DFL) framework to optimize the ATC prediction model, namely DeepATC. The methods based on graph convolutional network, inferring biological network and multimodel attentive fusion network were applied in DeepATC to extract the molecular topological information and low-dimensional representation from the molecular graph and heterogeneous biological networks. The results indicated that DeepATC achieved superior model performance with area under the curve (AUC) value at 0.968. Furthermore, the DFL framework was performed for the transcriptome data-based ATC prediction, as well as another independent task that is significantly relevant to drug discovery, namely drug-target interaction. The DFL-based model achieved excellent performance in the above-extended validation task, suggesting that the idea of aggregating the heterogeneous biological network and node's (molecule or protein) self-topological features will bring inspiration for broader drug repurposing and discovery research.
Collapse
Affiliation(s)
- Xiting Wang
- Life Science School, Beijing University of Chinese Medicine, Beijing, China
| | - Meng Liu
- Chinese Medicine School, Beijing University of Chinese Medicine, Beijing, China
| | - Yiling Zhang
- Beijing University of Chinese Medicine, Beijing, China
| | - Shuangshuang He
- Chinese Medicine School, Beijing University of Chinese Medicine, Beijing, China
| | - Caimeng Qin
- School of Life Sciences, Beijing University of Chinese Medicine and Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Yu Li
- Chinese Medicine School, Beijing University of Chinese Medicine, Beijing, China
| | - Tao Lu
- Integrative Medicine Center in School of Life Sciences, Beijing University of Chinese Medicine, Beijing, China
| |
Collapse
|
32
|
Analysis of the Sequence Characteristics of Antifreeze Protein. Life (Basel) 2021; 11:life11060520. [PMID: 34204983 PMCID: PMC8226703 DOI: 10.3390/life11060520] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Revised: 05/27/2021] [Accepted: 05/31/2021] [Indexed: 12/31/2022] Open
Abstract
Antifreeze protein (AFP) is a proteinaceous compound with improved antifreeze ability and binding ability to ice to prevent its growth. As a surface-active material, a small number of AFPs have a tremendous influence on the growth of ice. Therefore, identifying novel AFPs is important to understand protein–ice interactions and create novel ice-binding domains. To date, predicting AFPs is difficult due to their low sequence similarity for the ice-binding domain and the lack of common features among different AFPs. Here, a computational engine was developed to predict the features of AFPs and reveal the most important 39 features for AFP identification, such as antifreeze-like/N-acetylneuraminic acid synthase C-terminal, insect AFP motif, C-type lectin-like, and EGF-like domain. With this newly presented computational method, a group of previously confirmed functional AFP motifs was screened out. This study has identified some potential new AFP motifs and contributes to understanding biological antifreeze mechanisms.
Collapse
|
33
|
Wang L, Li J, Liu J, Chang M. RAMRSGL: A Robust Adaptive Multinomial Regression Model for Multicancer Classification. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:5584684. [PMID: 34122617 PMCID: PMC8172296 DOI: 10.1155/2021/5584684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Accepted: 05/12/2021] [Indexed: 11/17/2022]
Abstract
In view of the challenges of the group Lasso penalty methods for multicancer microarray data analysis, e.g., dividing genes into groups in advance and biological interpretability, we propose a robust adaptive multinomial regression with sparse group Lasso penalty (RAMRSGL) model. By adopting the overlapping clustering strategy, affinity propagation clustering is employed to obtain each cancer gene subtype, which explores the group structure of each cancer subtype and merges the groups of all subtypes. In addition, the data-driven weights based on noise are added to the sparse group Lasso penalty, combining with the multinomial log-likelihood function to perform multiclassification and adaptive group gene selection simultaneously. The experimental results on acute leukemia data verify the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Lei Wang
- Department of Basic Science Teaching, Henan Polytechnic Institute, Nanyang, 473000 Henan, China
| | - Juntao Li
- College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007 Henan, China
| | - Juanfang Liu
- College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007 Henan, China
| | - Mingming Chang
- College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007 Henan, China
| |
Collapse
|
34
|
Chen L, Li Z, Zeng T, Zhang YH, Li H, Huang T, Cai YD. Predicting gene phenotype by multi-label multi-class model based on essential functional features. Mol Genet Genomics 2021; 296:905-918. [PMID: 33914130 DOI: 10.1007/s00438-021-01789-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2021] [Accepted: 04/13/2021] [Indexed: 12/19/2022]
Abstract
Phenotype is one of the most significant concepts in genetics, which is used to describe all the characteristics of a research object that can be observed. Considering that phenotype reflects the integrated features of genotype and environment factors, it is hard to define phenotype characteristics, even difficult to predict unknown phenotypes. Restricted by current biological techniques, it is still quite expensive and time-consuming to obtain sufficient structural information of large-scale phenotype-associated genes/proteins. Various bioinformatics methods have been presented to solve such problem, and researchers have confirmed the efficacy and prediction accuracy of functional network-based prediction. But general functional descriptions have highly complicated inner structures for phenotype prediction. To further address this issue and improve the efficacy of phenotype prediction on more than ten kinds of phenotypes, we first extract functional enrichment features from GO and KEGG, and then use node2vec to learn functional embedding features of genes from a gene-gene network. All these features are analyzed by some feature selection methods (Boruta, minimum redundancy maximum relevance) to generate a feature list. Such list is fed into the incremental feature selection, incorporating some multi-label classifiers built by RAkEL and some classic base classifiers, to build an optimum multi-label multi-class classification model for phenotype prediction. According to recent researches, our method has indeed identified many literature-supported genes/proteins and their associated phenotypes, and even some candidate genes with re-assigned new phenotypes, which provide a new computational tool for the accurate and effective phenotypic prediction.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai, 200444, People's Republic of China.,College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, 130052, People's Republic of China
| | - Tao Zeng
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, People's Republic of China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, 130052, People's Republic of China
| | - Tao Huang
- Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, 200031, People's Republic of China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, People's Republic of China.
| |
Collapse
|
35
|
Zhang YH, Li Z, Zeng T, Chen L, Li H, Gamarra M, Mansour RF, Escorcia-Gutierrez J, Huang T, Cai YD. Investigating gene methylation signatures for fetal intolerance prediction. PLoS One 2021; 16:e0250032. [PMID: 33886611 PMCID: PMC8062050 DOI: 10.1371/journal.pone.0250032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 03/29/2021] [Indexed: 11/29/2022] Open
Abstract
Pregnancy is a complicated and long procedure during one or more offspring development inside a woman. A short period of oxygen shortage after birth is quite normal for most babies and does not threaten their health. However, if babies have to suffer from a long period of oxygen shortage, then this condition is an indication of pathological fetal intolerance, which probably causes their death. The identification of the pathological fetal intolerance from the physical oxygen shortage is one of the important clinical problems in obstetrics for a long time. The clinical syndromes typically manifest five symptoms that indicate that the baby may suffer from fetal intolerance. At present, liquid biopsy combined with high-throughput sequencing or mass spectrum techniques provides a quick approach to detect real-time alteration in the peripheral blood at multiple levels with the rapid development of molecule sequencing technologies. Gene methylation is functionally correlated with gene expression; thus, the combination of gene methylation and expression information would help in screening out the key regulators for the pathogenesis of fetal intolerance. We combined gene methylation and expression features together and screened out the optimal features, including gene expression or methylation signatures, for fetal intolerance prediction for the first time. In addition, we applied various computational methods to construct a comprehensive computational pipeline to identify the potential biomarkers for fetal intolerance dependent on the liquid biopsy samples. We set up qualitative and quantitative computational models for the prediction for fetal intolerance during pregnancy. Moreover, we provided a new prospective for the detailed pathological mechanism of fetal intolerance. This work can provide a solid foundation for further experimental research and contribute to the application of liquid biopsy in antenatal care.
Collapse
Affiliation(s)
- Yu-Hang Zhang
- School of Life Sciences, Shanghai University, Shanghai, China
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, United States of America
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Margarita Gamarra
- Department of Computational Science and Electronic, Universidad de la Costa, CUC, Barranquilla, Colombia
| | - Romany F. Mansour
- Department of Mathematics, Faculty of Science, New Valley University, El-Kharga, Egypt
| | - José Escorcia-Gutierrez
- Electronic and Telecommunicacions Program, Universidad Autónoma del Caribe, Barranquilla, Colombia
- * E-mail: (JEG); (TH); (YDC)
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- * E-mail: (JEG); (TH); (YDC)
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- * E-mail: (JEG); (TH); (YDC)
| |
Collapse
|
36
|
Identifying Infliximab- (IFX-) Responsive Blood Signatures for the Treatment of Rheumatoid Arthritis. BIOMED RESEARCH INTERNATIONAL 2021. [DOI: 10.1155/2021/5556784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Rheumatoid arthritis (RA) is a severe chronic pathogenic inflammatory abnormality that damages small joints. Comprehensive diagnosis and treatment procedures for RA have been established because of its severe symptoms and relatively high morbidity. Medication and surgery are the two major therapeutic approaches. Infliximab (IFX) is a novel biological agent applied for the treatment of RA. IFX improves physical functions and benefits the achievement of clinical remission even under discontinuous medication. However, not all patients react to IFX, and distinguishing IFX-sensitive and IFX-resistant patients is quite difficult. Thus, how to predict the therapeutic effects of IFX on patients with RA is one of the urgent translational medicine problems in the clinical treatment of RA. In this study, we present a novel computational method for the identification of the applicable and substantial blood gene signatures of IFX sensitivity by liquid biopsy, which may assist in the establishment of a clinical drug sensitivity test standard for RA and contribute to the revelation of unique IFX-associated pharmacological mechanisms.
Collapse
|
37
|
Yuan F, Li Z, Chen L, Zeng T, Zhang YH, Ding S, Huang T, Cai YD. Identifying the Signatures and Rules of Circulating Extracellular MicroRNA for Distinguishing Cancer Subtypes. Front Genet 2021; 12:651610. [PMID: 33767734 PMCID: PMC7985347 DOI: 10.3389/fgene.2021.651610] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Accepted: 02/10/2021] [Indexed: 12/24/2022] Open
Abstract
Cancer is one of the most threatening diseases to humans. It can invade multiple significant organs, including lung, liver, stomach, pancreas, and even brain. The identification of cancer biomarkers is one of the most significant components of cancer studies as the foundation of clinical cancer diagnosis and related drug development. During the large-scale screening for cancer prevention and early diagnosis, obtaining cancer-related tissues is impossible. Thus, the identification of cancer-associated circulating biomarkers from liquid biopsy targeting has been proposed and has become the most important direction for research on clinical cancer diagnosis. Here, we analyzed pan-cancer extracellular microRNA profiles by using multiple machine-learning models. The extracellular microRNA profiles on 11 cancer types and non-cancer were first analyzed by Boruta to extract important microRNAs. Selected microRNAs were then evaluated by the Max-Relevance and Min-Redundancy feature selection method, resulting in a feature list, which were fed into the incremental feature selection method to identify candidate circulating extracellular microRNA for cancer recognition and classification. A series of quantitative classification rules was also established for such cancer classification, thereby providing a solid research foundation for further biomarker exploration and functional analyses of tumorigenesis at the level of circulating extracellular microRNA.
Collapse
Affiliation(s)
- Fei Yuan
- School of Life Sciences, Shanghai University, Shanghai, China
- Department of Science and Technology, Binzhou Medical University Hospital, Binzhou, China
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, United States
| | - Shijian Ding
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
38
|
Identifying the Immunological Gene Signatures of Immune Cell Subtypes. BIOMED RESEARCH INTERNATIONAL 2021. [DOI: 10.1155/2021/6639698] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
The immune system is a complicated defensive system that comprises multiple functional cells and molecules acting against endogenous and exogenous pathogenic factors. Identifying immune cell subtypes and recognizing their unique immunological functions are difficult because of the complicated cellular components and immunological functions of the immune system. With the development of transcriptomics and high-throughput sequencing, the gene expression profiling of immune cells can provide a new strategy to explore the immune cell subtyping. On the basis of the new profiling data of mouse immune cell gene expression from the Immunological Genome Project (ImmGen), a novel computational pipeline was applied to identify different immune cell subtypes, including αβ T cells, B cells, γδ T cells, and innate lymphocytes. First, the profiling data was analyzed by a powerful feature selection method, Monte-Carlo Feature Selection, resulting in a feature list and some informative features. For the list, the two-stage incremental feature selection method, incorporating random forest as the classification algorithm, was applied to extract essential gene signatures and build an efficient classifier. On the other hand, a rule learning scheme was applied on the informative features to construct quantitative expression rules. A group of gene signatures was found as qualitatively related to the biological processes of four immune cell subtypes. The quantitative expression rules can efficiently cluster immune cells. This work provides a novel computational tool for immune cell quantitative subtyping and biomarker recognition.
Collapse
|
39
|
Peng X, Chen L, Zhou JP. Identification of Carcinogenic Chemicals with Network Embedding and Deep Learning Methods. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200414084317] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Background:
Cancer is the second leading cause of human death in the world. To date,
many factors have been confirmed to be the cause of cancer. Among them, carcinogenic chemicals
have been widely accepted as the important ones. Traditional methods for detecting carcinogenic
chemicals are of low efficiency and high cost.
Objective:
The aim of this study was to design an efficient computational method for the
identification of carcinogenic chemicals.
Methods:
A new computational model was proposed for detecting carcinogenic chemicals. As a
data-driven model, carcinogenic and non-carcinogenic chemicals were obtained from Carcinogenic
Potency Database (CPDB). These chemicals were represented by features extracted from five
chemical networks, representing five types of chemical associations, via a network embedding
method, Mashup. Obtained features were fed into a powerful deep learning method, recurrent
neural network, to build the model.
Results:
The jackknife test on such model provided the F-measure of 0.971 and AUROC of 0.971.
Conclusion:
The proposed model was quite effective and was superior to the models with
traditional machine learning algorithms, classic chemical encoding schemes or direct usage of
chemical associations.
Collapse
Affiliation(s)
- Xuefei Peng
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Jian-Peng Zhou
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
40
|
Zhang YH, Li Z, Zeng T, Chen L, Li H, Huang T, Cai YD. Detecting the Multiomics Signatures of Factor-Specific Inflammatory Effects on Airway Smooth Muscles. Front Genet 2021; 11:599970. [PMID: 33519902 PMCID: PMC7838645 DOI: 10.3389/fgene.2020.599970] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 12/14/2020] [Indexed: 12/19/2022] Open
Abstract
Smooth muscles are a specific muscle subtype that is widely identified in the tissues of internal passageways. This muscle subtype has the capacity for controlled or regulated contraction and relaxation. Airway smooth muscles are a unique type of smooth muscles that constitute the effective, adjustable, and reactive wall that covers most areas of the entire airway from the trachea to lung tissues. Infection with SARS-CoV-2, which caused the world-wide COVID-19 pandemic, involves airway smooth muscles and their surrounding inflammatory environment. Therefore, airway smooth muscles and related inflammatory factors may play an irreplaceable role in the initiation and progression of several severe diseases. Many previous studies have attempted to reveal the potential relationships between interleukins and airway smooth muscle cells only on the omics level, and the continued existence of numerous false-positive optimal genes/transcripts cannot reflect the actual effective biological mechanisms underlying interleukin-based activation effects on airway smooth muscles. Here, on the basis of newly presented machine learning-based computational approaches, we identified specific regulatory factors and a series of rules that contribute to the activation and stimulation of airway smooth muscles by IL-13, IL-17, or the combination of both interleukins on the epigenetic and/or transcriptional levels. The detected discriminative factors (genes) and rules can contribute to the identification of potential regulatory mechanisms linking airway smooth muscle tissues and inflammatory factors and help reveal specific pathological factors for diseases associated with airway smooth muscle inflammation on multiomics levels.
Collapse
Affiliation(s)
- Yu-Hang Zhang
- School of Life Sciences, Shanghai University, Shanghai, China
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, United States
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Huang
- Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
41
|
Zhang YH, Li H, Zeng T, Chen L, Li Z, Huang T, Cai YD. Identifying Transcriptomic Signatures and Rules for SARS-CoV-2 Infection. Front Cell Dev Biol 2021; 8:627302. [PMID: 33505977 PMCID: PMC7829664 DOI: 10.3389/fcell.2020.627302] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 12/14/2020] [Indexed: 12/26/2022] Open
Abstract
The world-wide Coronavirus Disease 2019 (COVID-19) pandemic was triggered by the widespread of a new strain of coronavirus named as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Multiple studies on the pathogenesis of SARS-CoV-2 have been conducted immediately after the spread of the disease. However, the molecular pathogenesis of the virus and related diseases has still not been fully revealed. In this study, we attempted to identify new transcriptomic signatures as candidate diagnostic models for clinical testing or as therapeutic targets for vaccine design. Using the recently reported transcriptomics data of upper airway tissue with acute respiratory illnesses, we integrated multiple machine learning methods to identify effective qualitative biomarkers and quantitative rules for the distinction of SARS-CoV-2 infection from other infectious diseases. The transcriptomics data was first analyzed by Boruta so that important features were selected, which were further evaluated by the minimum redundancy maximum relevance method. A feature list was produced. This list was fed into the incremental feature selection, incorporating some classification algorithms, to extract qualitative biomarker genes and construct quantitative rules. Also, an efficient classifier was built to identify patients infected with SARS-COV-2. The findings reported in this study may help in revealing the potential pathogenic mechanisms of COVID-19 and finding new targets for vaccine design.
Collapse
Affiliation(s)
- Yu-Hang Zhang
- School of Life Sciences, Shanghai University, Shanghai, China
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, United States
| | - Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
42
|
Zhu L, Yang X, Zhu R, Yu L. Identifying Discriminative Biological Function Features and Rules for Cancer-Related Long Non-coding RNAs. Front Genet 2021; 11:598773. [PMID: 33391350 PMCID: PMC7772407 DOI: 10.3389/fgene.2020.598773] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Accepted: 11/23/2020] [Indexed: 01/17/2023] Open
Abstract
Cancer has been a major public health problem worldwide for many centuries. Cancer is a complex disease associated with accumulative genetic mutations, epigenetic aberrations, chromosomal instability, and expression alteration. Increasing lines of evidence suggest that many non-coding transcripts, which are termed as non-coding RNAs, have important regulatory roles in cancer. In particular, long non-coding RNAs (lncRNAs) play crucial roles in tumorigenesis. Cancer-related lncRNAs serve as oncogenic factors or tumor suppressors. Although many lncRNAs are identified as potential regulators in tumorigenesis by using traditional experimental methods, they are time consuming and expensive considering the tremendous amount of lncRNAs needed. Thus, effective and fast approaches to recognize tumor-related lncRNAs should be developed. The proposed approach should help us understand not only the mechanisms of lncRNAs that participate in tumorigenesis but also their satisfactory performance in distinguishing cancer-related lncRNAs. In this study, we utilized a decision tree (DT), a type of rule learning algorithm, to investigate cancer-related lncRNAs with functional annotation contents [gene ontology (GO) terms and KEGG pathways] of their co-expressed genes. Cancer-related and other lncRNAs encoded by the key enrichment features of GO and KEGG filtered by feature selection methods were used to build an informative DT, which further induced several decision rules. The rules provided not only a new tool for identifying cancer-related lncRNAs but also connected the lncRNAs and cancers with the combinations of GO terms. Results provided new directions for understanding cancer-related lncRNAs.
Collapse
Affiliation(s)
- Liucun Zhu
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Xin Yang
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Rui Zhu
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Lei Yu
- Department of Medical Oncology, Shanghai Concord Medical Cancer Center, Shanghai, China
| |
Collapse
|
43
|
iMPTCE-Hnetwork: A Multilabel Classifier for Identifying Metabolic Pathway Types of Chemicals and Enzymes with a Heterogeneous Network. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:6683051. [PMID: 33488764 PMCID: PMC7803417 DOI: 10.1155/2021/6683051] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Revised: 12/16/2020] [Accepted: 12/19/2020] [Indexed: 12/16/2022]
Abstract
Metabolic pathway is an important type of biological pathways. It produces essential molecules and energies to maintain the life of living organisms. Each metabolic pathway consists of a chain of chemical reactions, which always need enzymes to participate in. Thus, chemicals and enzymes are two major components for each metabolic pathway. Although several metabolic pathways have been uncovered, the metabolic pathway system is still far from complete. Some hidden chemicals or enzymes are not discovered in a certain metabolic pathway. Besides the traditional experiments to detect hidden chemicals or enzymes, an alternative pipeline is to design efficient computational methods. In this study, we proposed a powerful multilabel classifier, called iMPTCE-Hnetwork, to uniformly assign chemicals and enzymes to metabolic pathway types reported in KEGG. Such classifier adopted the embedding features derived from a heterogeneous network, which defined chemicals and enzymes as nodes and the interactions between chemicals and enzymes as edges, through a powerful network embedding algorithm, Mashup. The popular RAndom k-labELsets (RAKEL) algorithm was employed to construct the classifier, which incorporated the support vector machine (polynomial kernel) as the basic classifier. The ten-fold cross-validation results indicated that such a classifier had good performance with accuracy higher than 0.800 and exact match higher than 0.750. Several comparisons were done to indicate the superiority of the iMPTCE-Hnetwork.
Collapse
|
44
|
Zhang YH, Li Z, Zeng T, Pan X, Chen L, Liu D, Li H, Huang T, Cai YD. Distinguishing Glioblastoma Subtypes by Methylation Signatures. Front Genet 2020; 11:604336. [PMID: 33329750 PMCID: PMC7732602 DOI: 10.3389/fgene.2020.604336] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Accepted: 11/02/2020] [Indexed: 11/13/2022] Open
Abstract
Glioblastoma, also called glioblastoma multiform (GBM), is the most aggressive cancer that initiates within the brain. GBM is produced in the central nervous system. Cancer cells in GBM are similar to stem cells. Several different schemes for GBM stratification exist. These schemes are based on intertumoral molecular heterogeneity, preoperative images, and integrated tumor characteristics. Although the formation of glioblastoma is remarkably related to gene methylation, GBM has been poorly classified by epigenetics. To classify glioblastoma subtypes on the basis of different degrees of genes' methylation, we adopted several powerful machine learning algorithms to identify numerous methylation features (sites) associated with the classification of GBM. The features were first analyzed by an excellent feature selection method, Monte Carlo feature selection (MCFS), resulting in a feature list. Then, such list was fed into the incremental feature selection (IFS), incorporating one classification algorithm, to extract essential sites. These sites can be annotated onto coding genes, such as CXCR4, TBX18, SP5, and TMEM22, and enriched in relevant biological functions related to GBM classification (e.g., subtype-specific functions). Representative functions, such as nervous system development, intrinsic plasma membrane component, calcium ion binding, systemic lupus erythematosus, and alcoholism, are potential pathogenic functions that participate in the initiation and progression of glioblastoma and its subtypes. With these sites, an efficient model can be built to classify the subtypes of glioblastoma.
Collapse
Affiliation(s)
- Yu-Hang Zhang
- School of Life Sciences, Shanghai University, Shanghai, China
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, United States
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Zeng
- Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, Shanghai, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Dejing Liu
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
45
|
Chen L, Li Z, Zeng T, Zhang YH, Liu D, Li H, Huang T, Cai YD. Identifying Robust Microbiota Signatures and Interpretable Rules to Distinguish Cancer Subtypes. Front Mol Biosci 2020; 7:604794. [PMID: 33330634 PMCID: PMC7672214 DOI: 10.3389/fmolb.2020.604794] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Accepted: 10/15/2020] [Indexed: 12/11/2022] Open
Abstract
Cancer can be generally defined as a cluster of systematic diseases triggered by abnormal cell proliferation and growth. With the development of biological sciences and biotechnologies, the etiology of cancer is partially revealed, including some of the most substantial pathogenic factors [either endogenous (genetics) or exogenous (environmental)]. However, some remaining factors that contribute to the tumorigenesis but have not been analyzed and discussed in detail remain. For instance, some typical correlations between microorganisms and tumorigenesis have been reported already, but previous studies are just sporadic studies on single microorganism–cancer subtype pairs and do not explain and validate the specific contribution of microbiome on tumorigenesis. On the basis of the systematic microbiome analyses of blood and cancer-associated tissues in cancer patients/controls in public domain, we performed interpretable analyses. We identified several core regulatory microorganisms that contribute to the classification of multiple tumor subtypes and established quantitative predictive models for interpretable prediction by using multiple machine learning methods. We also compared the optimal features (microorganisms) and rules identified from microbiome profiles processed using the Kraken and the SHOGUN. Collectively, our study identified new microbiome signatures and their interpretable classification rules for cancer discrimination and carried out reliable methodological comparison for robust cancer microbiome analyses, thereby promoting the development of tumor etiology at the microbiome level.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai, China.,College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Zeng
- Zhangjiang Laboratory, Institute of Brain-Intelligence Technology, Shanghai, China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, United States
| | - Dejing Liu
- Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Huang
- Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
46
|
Identification of Latent Oncogenes with a Network Embedding Method and Random Forest. BIOMED RESEARCH INTERNATIONAL 2020; 2020:5160396. [PMID: 33029511 PMCID: PMC7530476 DOI: 10.1155/2020/5160396] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 09/09/2020] [Accepted: 09/14/2020] [Indexed: 12/29/2022]
Abstract
Oncogene is a special type of genes, which can promote the tumor initiation. Good study on oncogenes is helpful for understanding the cause of cancers. Experimental techniques in early time are quite popular in detecting oncogenes. However, their defects become more and more evident in recent years, such as high cost and long time. The newly proposed computational methods provide an alternative way to study oncogenes, which can provide useful clues for further investigations on candidate genes. Considering the limitations of some previous computational methods, such as lack of learning procedures and terming genes as individual subjects, a novel computational method was proposed in this study. The method adopted the features derived from multiple protein networks, viewing proteins in a system level. A classic machine learning algorithm, random forest, was applied on these features to capture the essential characteristic of oncogenes, thereby building the prediction model. All genes except validated oncogenes were ranked with a measurement yielded by the prediction model. Top genes were quite different from potential oncogenes discovered by previous methods, and they can be confirmed to become novel oncogenes. It was indicated that the newly identified genes can be essential supplements for previous results.
Collapse
|
47
|
Alternative Polyadenylation Modification Patterns Reveal Essential Posttranscription Regulatory Mechanisms of Tumorigenesis in Multiple Tumor Types. BIOMED RESEARCH INTERNATIONAL 2020; 2020:6384120. [PMID: 32626751 PMCID: PMC7315320 DOI: 10.1155/2020/6384120] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 05/26/2020] [Accepted: 05/30/2020] [Indexed: 12/11/2022]
Abstract
Among various risk factors for the initiation and progression of cancer, alternative polyadenylation (APA) is a remarkable endogenous contributor that directly triggers the malignant phenotype of cancer cells. APA affects biological processes at a transcriptional level in various ways. As such, APA can be involved in tumorigenesis through gene expression, protein subcellular localization, or transcription splicing pattern. The APA sites and status of different cancer types may have diverse modification patterns and regulatory mechanisms on transcripts. Potential APA sites were screened by applying several machine learning algorithms on a TCGA-APA dataset. First, a powerful feature selection method, minimum redundancy maximum relevancy, was applied on the dataset, resulting in a feature list. Then, the feature list was fed into the incremental feature selection, which incorporated the support vector machine as the classification algorithm, to extract key APA features and build a classifier. The classifier can classify cancer patients into cancer types with perfect performance. The key APA-modified genes had a potential prognosis ability because of their significant power in the survival analysis of TCGA pan-cancer data.
Collapse
|
48
|
Zhang S, Zeng T, Hu B, Zhang YH, Feng K, Chen L, Niu Z, Li J, Huang T, Cai YD. Discriminating Origin Tissues of Tumor Cell Lines by Methylation Signatures and Dys-Methylated Rules. Front Bioeng Biotechnol 2020; 8:507. [PMID: 32528944 PMCID: PMC7264161 DOI: 10.3389/fbioe.2020.00507] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2020] [Accepted: 04/30/2020] [Indexed: 12/18/2022] Open
Abstract
DNA methylation is an essential epigenetic modification for multiple biological processes. DNA methylation in mammals acts as an epigenetic mark of transcriptional repression. Aberrant levels of DNA methylation can be observed in various types of tumor cells. Thus, DNA methylation has attracted considerable attention among researchers to provide new and feasible tumor therapies. Conventional studies considered single-gene methylation or specific loci as biomarkers for tumorigenesis. However, genome-scale methylated modification has not been completely investigated. Thus, we proposed and compared two novel computational approaches based on multiple machine learning algorithms for the qualitative and quantitative analyses of methylation-associated genes and their dys-methylated patterns. This study contributes to the identification of novel effective genes and the establishment of optimal quantitative rules for aberrant methylation distinguishing tumor cells with different origin tissues.
Collapse
Affiliation(s)
- Shiqi Zhang
- School of Life Sciences, Shanghai University, Shanghai, China.,Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | - Tao Zeng
- Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, Shanghai, China
| | - Bin Hu
- State Key Laboratory of Livestock and Poultry Breeding, Guangdong Public Laboratory of Animal Breeding and Nutrition, Guangdong Key Laboratory of Animal Breeding and Nutrition, Institute of Animal Science, Guangdong Academy of Agricultural Sciences, Guangzhou, China
| | - Yu-Hang Zhang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic, Guangzhou, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Zhibin Niu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jianhao Li
- State Key Laboratory of Livestock and Poultry Breeding, Guangdong Public Laboratory of Animal Breeding and Nutrition, Guangdong Key Laboratory of Animal Breeding and Nutrition, Institute of Animal Science, Guangdong Academy of Agricultural Sciences, Guangzhou, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
49
|
Prediction of Drug Side Effects with a Refined Negative Sample Selection Strategy. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:1573543. [PMID: 32454877 PMCID: PMC7232712 DOI: 10.1155/2020/1573543] [Citation(s) in RCA: 49] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/28/2020] [Revised: 04/14/2020] [Accepted: 04/23/2020] [Indexed: 01/07/2023]
Abstract
Drugs are an important way to treat various diseases. However, they inevitably produce side effects, bringing great risks to human bodies and pharmaceutical companies. How to predict the side effects of drugs has become one of the essential problems in drug research. Designing efficient computational methods is an alternative way. Some studies paired the drug and side effect as a sample, thereby modeling the problem as a binary classification problem. However, the selection of negative samples is a key problem in this case. In this study, a novel negative sample selection strategy was designed for accessing high-quality negative samples. Such strategy applied the random walk with restart (RWR) algorithm on a chemical-chemical interaction network to select pairs of drugs and side effects, such that drugs were less likely to have corresponding side effects, as negative samples. Through several tests with a fixed feature extraction scheme and different machine-learning algorithms, models with selected negative samples produced high performance. The best model even yielded nearly perfect performance. These models had much higher performance than those without such strategy or with another selection strategy. Furthermore, it is not necessary to consider the balance of positive and negative samples under such a strategy.
Collapse
|
50
|
Yuan F, Pan X, Zeng T, Zhang YH, Chen L, Gan Z, Huang T, Cai YD. Identifying Cell-Type Specific Genes and Expression Rules Based on Single-Cell Transcriptomic Atlas Data. Front Bioeng Biotechnol 2020; 8:350. [PMID: 32411685 PMCID: PMC7201067 DOI: 10.3389/fbioe.2020.00350] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Accepted: 03/30/2020] [Indexed: 01/07/2023] Open
Abstract
Single-cell sequencing technologies have emerged to address new and longstanding biological and biomedical questions. Previous studies focused on the analysis of bulk tissue samples composed of millions of cells. However, the genomes within the cells of an individual multicellular organism are not always the same. In this study, we aimed to identify the crucial and characteristically expressed genes that may play functional roles in tissue development and organogenesis, by analyzing a single-cell transcriptomic atlas of mice. We identified the most relevant gene features and decision rules classifying 18 cell categories, providing a list of genes that may perform important functions in the process of tissue development because of their tissue-specific expression patterns. These genes may serve as biomarkers to identify the origin of unknown cell subgroups so as to recognize specific cell stages/states during the dynamic process, and also be applied as potential therapy targets for developmental disorders.
Collapse
Affiliation(s)
- Fei Yuan
- School of Life Sciences, Shanghai University, Shanghai, China.,Department of Science and Technology, Binzhou Medical University Hospital, Binzhou, China
| | - XiaoYong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - Tao Zeng
- Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Hang Zhang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China.,Shanghai Key Laboratory of Pure Mathematics and Mathematical Practice, East China Normal University, Shanghai, China
| | - Zijun Gan
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|