1
|
Raymond WS, DeRoo J, Munsky B. Identification of potential riboswitch elements in Homo sapiens mRNA 5'UTR sequences using positive-unlabeled machine learning. PLoS One 2025; 20:e0320282. [PMID: 40273288 PMCID: PMC12021280 DOI: 10.1371/journal.pone.0320282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Accepted: 02/17/2025] [Indexed: 04/26/2025] Open
Abstract
Riboswitches are a class of noncoding RNA structures that interact with target ligands to cause a conformational change that can then execute some regulatory purpose within the cell. Riboswitches are ubiquitous and well characterized in bacteria and prokaryotes, with additional examples also being found in fungi, plants, and yeast. To date, no purely RNA-small molecule riboswitch has been discovered in Homo Sapiens. Several analogous riboswitch-like mechanisms have been described within the H. Sapiens translatome within the past decade, prompting the question: Is there a H. Sapiens riboswitch dependent on only small molecule ligands? In this work, we set out to train positive unlabeled machine learning classifiers on known riboswitch sequences and apply the classifiers to H. Sapiens mRNA 5'UTR sequences found in the 5'UTR database, UTRdb, in the hope of identifying a set of mRNAs to investigate for riboswitch functionality. 67,683 riboswitch sequences were obtained from RNAcentral and sorted for ligand type and used as positive examples and 48,031 5'UTR sequences were used as unlabeled, unknown examples. Positive examples were sorted by ligand, and 20 positive-unlabeled classifiers were trained on sequence and secondary structure features while withholding one or two ligand classes. Cross validation was then performed on the withheld ligand sets to obtain a validation accuracy range of 75%-99%. The joint sets of 5'UTRs identified as potential riboswitches by the 20 classifiers were then analyzed. 1533 sequences were identified as a riboswitch by one or more classifier(s) and 436 of the H. Sapiens 5'UTRs were labeled as harboring potential riboswitch elements by all 20 classifiers. These 436 sequences were mapped back to the most similar riboswitches within the positive data and examined. An online database of identified and ranked 5'UTRs, their features, and their most similar matches to known riboswitches, is provided to guide future experimental efforts to identify H. Sapiens riboswitches.
Collapse
Affiliation(s)
- William S Raymond
- School of Biomedical Engineering, Colorado State University, Fort Collins, Colorado, United States of America
| | - Jacob DeRoo
- School of Biomedical Engineering, Colorado State University, Fort Collins, Colorado, United States of America
| | - Brian Munsky
- School of Biomedical Engineering, Colorado State University, Fort Collins, Colorado, United States of America
- Chemical and Biological Engineering, Colorado State University, Fort Collins, Colorado, United States of America
| |
Collapse
|
2
|
Molaei S, Jalili S. Disease candidate genes prediction using positive labeled and unlabeled instances. BMC Med Genomics 2025; 18:73. [PMID: 40241088 PMCID: PMC12004746 DOI: 10.1186/s12920-025-02109-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2024] [Accepted: 02/18/2025] [Indexed: 04/18/2025] Open
Abstract
Identifying disease genes and understanding their performance is critical in producing drugs for genetic diseases. Nowadays, laboratory approaches are not only used for disease gene identification but also using computational approaches like machine learning are becoming considerable for this purpose. In machine learning methods, researchers can only use two data types (disease genes and unknown genes) to predict disease candidate genes. Notably, there is no source for the negative data set. The proposed method is a two-step process: The first step is the extraction of reliable negative genes from a set of unlabeled genes by one-class learning and a filter based on distance indicators from known disease genes; this step is performed separately for each disease. The second step is the learning of a binary model using causing genes of each disease as a positive learning set and the reliable negative genes extracted from that disease. Each gene in the unlabeled gene's production and ranking step is assigned a normalized score using two filters and a learned model. Consequently, disease genes are predicted and ranked. The proposed method evaluation of various six diseases and Cancer class indicates better results than other studies.
Collapse
Affiliation(s)
- Sepideh Molaei
- Computer Engineering Department, Tarbiat Modares University, Tehran, Iran
| | - Saeed Jalili
- Computer Engineering Department, Tarbiat Modares University, Tehran, Iran.
| |
Collapse
|
3
|
Xiao L, Wu J, Fan L, Wang L, Zhu X. CLMT: graph contrastive learning model for microbe-drug associations prediction with transformer. Front Genet 2025; 16:1535279. [PMID: 40144888 PMCID: PMC11936976 DOI: 10.3389/fgene.2025.1535279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2024] [Accepted: 02/21/2025] [Indexed: 03/28/2025] Open
Abstract
Accurate prediction of microbe-drug associations is essential for drug development and disease diagnosis. However, existing methods often struggle to capture complex nonlinear relationships, effectively model long-range dependencies, and distinguish subtle similarities between microbes and drugs. To address these challenges, this paper introduces a new model for microbe-drug association prediction, CLMT. The proposed model differs from previous approaches in three key ways. Firstly, unlike conventional GCN-based models, CLMT leverages a Graph Transformer network with an attention mechanism to model high-order dependencies in the microbe-drug interaction graph, enhancing its ability to capture long-range associations. Then, we introduce graph contrastive learning, generating multiple augmented views through node perturbation and edge dropout. By optimizing a contrastive loss, CLMT distinguishes subtle structural variations, making the learned embeddings more robust and generalizable. By integrating multi-view contrastive learning and Transformer-based encoding, CLMT effectively mitigates data sparsity issues, significantly outperforming existing methods. Experimental results on three publicly available datasets demonstrate that CLMT achieves state-of-the-art performance, particularly in handling sparse data and nonlinear microbe-drug interactions, confirming its effectiveness for real-world biomedical applications. On the MDAD, aBiofilm, and Drug Virus datasets, CLMT outperforms the previously best model in terms of Accuracy by 4.3%, 3.5%, and 2.8%, respectively.
Collapse
Affiliation(s)
- Liqi Xiao
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, China
| | - Junlong Wu
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, China
| | - Liu Fan
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, China
| | - Lei Wang
- Technology Innovation Center of Changsha, Changsha University, Changsha, China
| | - Xianyou Zhu
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, China
- Hunan Engineering Research Center of Cyberspace Security Technology and Applications, Hengyang Normal University, Hengyang, China
| |
Collapse
|
4
|
Gong C, Zulfiqar MI, Zhang C, Mahmood S, Yang J. A recent survey on instance-dependent positive and unlabeled learning. FUNDAMENTAL RESEARCH 2025; 5:796-803. [PMID: 40242552 PMCID: PMC11997483 DOI: 10.1016/j.fmre.2022.09.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Revised: 06/22/2022] [Accepted: 09/07/2022] [Indexed: 11/07/2022] Open
Abstract
Training with confident positive-labeled instances has received a lot of attention in Positive and Unlabeled (PU) learning tasks, and this is formally termed "Instance-Dependent PU learning". In instance-dependent PU learning, whether a positive instance is labeled depends on its labeling confidence. In other words, it is assumed that not all positive instances have the same probability to be included by the positive set. Instead, the instances that are far from the potential decision boundary are with larger probability to be labeled than those that are close to the decision boundary. This setting has practical importance in many real-world applications such as medical diagnosis, outlier detection, object detection, etc. In this survey, we first present the preliminary knowledge of PU learning, and then review the representative instance-dependent PU learning settings and methods. After that, we thoroughly compare them with typical PU learning methods on various benchmark datasets and analyze their performances. Finally, we discuss the potential directions for future research.
Collapse
Affiliation(s)
- Chen Gong
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
- PCA Lab, the Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information, Ministry of Education, Nanjing 210094, China
| | - Muhammad Imran Zulfiqar
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
- Department of Computer Science and Information Technology, University of Jhang, Jhang 35200, Pakistan
| | - Chuang Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Shahid Mahmood
- Higher Education Department, Punjab, Faisalabad 38000, Pakistan
| | - Jian Yang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| |
Collapse
|
5
|
Wang N, Dong J, Ouyang D. AI-directed formulation strategy design initiates rational drug development. J Control Release 2025; 378:619-636. [PMID: 39719215 DOI: 10.1016/j.jconrel.2024.12.043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Revised: 11/27/2024] [Accepted: 12/18/2024] [Indexed: 12/26/2024]
Abstract
Rational drug development would be impossible without selecting the appropriate formulation route. However, pharmaceutical scientists often rely on limited personal experiences to perform trial-and-error tests on diverse formulation strategies. Such an inefficient screening manner not only wastes research investments but also threatens the safety of clinical volunteers and patients. A design-oriented paradigm for formulation strategy determination is urgently needed to initiate rational drug development. Herein, we introduce FormulationDT, the first data-driven and knowledge-guided artificial intelligence (AI) platform for rational formulation strategy design. Learning from approved drug formulations, FormulationDT devised a comprehensive formulation strategy design system containing 12 decisions for both oral and injectable administration. Utilizing PU-Decide, our specialized partially supervised learning framework designed for positive-unlabeled (PU) scenarios, FormulationDT developed precise and interpretable classification models for each decision, achieving area under the receiver operating characteristic curve (ROC_AUC) scores ranging from 0.78 to 0.98, with an average above 0.90. Incorporating extensive domain knowledge, FormulationDT is now accessible through a user-friendly web platform (http://formulationdt.computpharm.org/). Moreover, FormulationDT demonstrates its value by showcasing its application in proteolysis targeting chimeras (PROTACs) and recent drug approvals. Overall, this study created the first approved drug formulation dataset and tailored the PU-Decide framework to develop a high-performance, interpretable, and user-friendly AI formulation strategy design platform, which holds promise for driving risk reduction and efficiency gains across the life cycle of drug discovery and development.
Collapse
Affiliation(s)
- Nannan Wang
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences (ICMS), University of Macau, Macau, China
| | - Jie Dong
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, China.
| | - Defang Ouyang
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences (ICMS), University of Macau, Macau, China; Department of Public Health and Medicinal Administration, Faculty of Health Sciences (FHS), University of Macau, Macau, China.
| |
Collapse
|
6
|
Raymond WS, DeRoo J, Munsky B. Identification of potential riboswitch elements in Homo SapiensmRNA 5'UTR sequences using Positive-Unlabeled machine learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.23.568398. [PMID: 39677788 PMCID: PMC11642740 DOI: 10.1101/2023.11.23.568398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Riboswitches are a class of noncoding RNA structures that interact with target ligands to cause a conformational change that can then execute some regulatory purpose within the cell. Riboswitches are ubiquitous and well characterized in bacteria and prokaryotes, with additional examples also being found in fungi, plants, and yeast. To date, no purely RNA-small molecule riboswitch has been discovered in Homo Sapiens. Several analogous riboswitch-like mechanisms have been described within the H. Sapiens translatome within the past decade, prompting the question: Is there a H. Sapiens riboswitch dependent on only small molecule ligands? In this work, we set out to train positive unlabeled machine learning classifiers on known riboswitch sequences and apply the classifiers to H. Sapiens mRNA 5'UTR sequences found in the 5'UTR database, UTRdb, in the hope of identifying a set of mRNAs to investigate for riboswitch functionality. 67,683 riboswitch sequences were obtained from RNAcentral and sorted for ligand type and used as positive examples and 48,031 5'UTR sequences were used as unlabeled, unknown examples. Positive examples were sorted by ligand, and 20 positive-unlabeled classifiers were trained on sequence and secondary structure features while withholding one or two ligand classes. Cross validation was then performed on the withheld ligand sets to obtain a validation accuracy range of 75%-99%. The joint sets of 5'UTRs identified as potential riboswitches by the 20 classifiers were then analyzed. 15333 sequences were identified as a riboswitch by one or more classifier(s) and 436 of the H. Sapiens 5'UTRs were labeled as harboring potential riboswitch elements by all 20 classifiers. These 436 sequences were mapped back to the most similar riboswitches within the positive data and examined. An online database of identified and ranked 5'UTRs, their features, and their most similar matches to known riboswitches, is provided to guide future experimental efforts to identify H. Sapiens riboswitches.
Collapse
Affiliation(s)
- William S. Raymond
- School of Biomedical Engineering, Colorado State University Fort Collins, CO 80523, USA
| | - Jacob DeRoo
- School of Biomedical Engineering, Colorado State University Fort Collins, CO 80523, USA
| | - Brian Munsky
- School of Biomedical Engineering, Colorado State University Fort Collins, CO 80523, USA
- Chemical and Biological Engineering, Colorado State University Fort Collins, CO 80523, USA
| |
Collapse
|
7
|
Shi W, Zhang Y, Sun Y, Lin Z. Function-Genes and Disease-Genes Prediction Based on Network Embedding and One-Class Classification. Interdiscip Sci 2024; 16:781-801. [PMID: 39230798 DOI: 10.1007/s12539-024-00638-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 05/14/2024] [Accepted: 05/21/2024] [Indexed: 09/05/2024]
Abstract
Using genes which have been experimentally-validated for diseases (functions) can develop machine learning methods to predict new disease/function-genes. However, the prediction of both function-genes and disease-genes faces the same problem: there are only certain positive examples, but no negative examples. To solve this problem, we proposed a function/disease-genes prediction algorithm based on network embedding (Variational Graph Auto-Encoders, VGAE) and one-class classification (Fast Minimum Covariance Determinant, Fast-MCD): VGAEMCD. Firstly, we constructed a protein-protein interaction (PPI) network centered on experimentally-validated genes; then VGAE was used to get the embeddings of nodes (genes) in the network; finally, the embeddings were input into the improved deep learning one-class classifier based on Fast-MCD to predict function/disease-genes. VGAEMCD can predict function-gene and disease-gene in a unified way, and only the experimentally-verified genes are needed to provide (no need for expression profile). VGAEMCD outperforms classical one-class classification algorithms in Recall, Precision, F-measure, Specificity, and Accuracy. Further experiments show that seven metrics of VGAEMCD are higher than those of state-of-art function/disease-genes prediction algorithms. The above results indicate that VGAEMCD can well learn the distribution characteristics of positive examples and accurately identify function/disease-genes.
Collapse
Affiliation(s)
- Weiyu Shi
- College of Maritime Economics and Management, Dalian Maritime University, Dalian, 116026, China
| | - Yan Zhang
- Institute of Environmental Systems Biology, College of Environmental Science and Engineering, Dalian Maritime University, Dalian, 116026, China
| | - Yeqing Sun
- Institute of Environmental Systems Biology, College of Environmental Science and Engineering, Dalian Maritime University, Dalian, 116026, China.
| | - Zhengkui Lin
- College of Maritime Economics and Management, Dalian Maritime University, Dalian, 116026, China.
| |
Collapse
|
8
|
Mandal S, Jammal AA, Malek D, Medeiros FA. Progression or Aging? A Deep Learning Approach for Distinguishing Glaucoma Progression From Age-Related Changes in OCT Scans. Am J Ophthalmol 2024; 266:46-55. [PMID: 38703802 DOI: 10.1016/j.ajo.2024.04.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 04/16/2024] [Accepted: 04/29/2024] [Indexed: 05/06/2024]
Abstract
PURPOSE To develop deep learning (DL) algorithm to detect glaucoma progression using optical coherence tomography (OCT) images, in the absence of a reference standard. DESIGN Retrospective cohort study. METHODS Glaucomatous and healthy eyes with ≥5 reliable peripapillary OCT (Spectralis, Heidelberg Engineering) circle scans were included. A weakly supervised time-series learning model, called noise positive-unlabeled (Noise-PU) DL was developed to classify whether sequences of OCT B-scans showed glaucoma progression. The model used 2 learning schemes, one to identify age-related changes by differentiating test sequences from glaucoma vs healthy eyes, and the other to identify test-retest variability based on scrambled OCTs of glaucoma eyes. Both models' bases were convolutional neural networks (CNN) and long short-term memory (LSTM) networks which were combined to form a CNN-LSTM model. Model features were combined and jointly trained to identify glaucoma progression, accounting for age-related loss. The DL model's outcomes were compared with ordinary least squares (OLS) regression of retinal nerve fiber layer (RNFL) thickness over time, matched for specificity. The hit ratio was used as a proxy for sensitivity. RESULTS Eight thousand seven hundred eighty-five follow-up sequences of 5 consecutive OCT tests from 3253 eyes (1859 subjects) were included in the study. The mean follow-up time was 3.5 ± 1.6 years. In the test sample, the hit ratios of the DL and OLS methods were 0.498 (95%CI: 0.470-0.526) and 0.284 (95%CI: 0.258-0.309) respectively (P < .001) when the specificities were equalized to 95%. CONCLUSION A DL model was able to identify longitudinal glaucomatous structural changes in OCT B-scans using a surrogate reference standard for progression.
Collapse
Affiliation(s)
- Sayan Mandal
- From the Department of Electrical and Computer Engineering, Pratt School of Engineering (S.M., F.A.M.), Duke University, Durham, North Carolina, USA
| | - Alessandro A Jammal
- Duke Eye Center and Department of Ophthalmology (A.A.J., F.A.M.), Duke University, Durham, North Carolina, USA; Bascom Palmer Eye Institute (A.A.J., D.M., F.A.M.), University of Miami, Miami, Florida, USA
| | - Davina Malek
- Bascom Palmer Eye Institute (A.A.J., D.M., F.A.M.), University of Miami, Miami, Florida, USA
| | - Felipe A Medeiros
- From the Department of Electrical and Computer Engineering, Pratt School of Engineering (S.M., F.A.M.), Duke University, Durham, North Carolina, USA; Duke Eye Center and Department of Ophthalmology (A.A.J., F.A.M.), Duke University, Durham, North Carolina, USA; Bascom Palmer Eye Institute (A.A.J., D.M., F.A.M.), University of Miami, Miami, Florida, USA.
| |
Collapse
|
9
|
Zhapa-Camacho F, Tang Z, Kulmanov M, Hoehndorf R. Predicting protein functions using positive-unlabeled ranking with ontology-based priors. Bioinformatics 2024; 40:i401-i409. [PMID: 38940168 PMCID: PMC11211813 DOI: 10.1093/bioinformatics/btae237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem where only positive samples are defined and there is a large number of unlabeled annotations. Most existing methods rely on the assumption that the unlabeled set of protein function annotations are negatives, inducing the false negative issue, where potential positive samples are trained as negatives. We introduce a novel approach named PU-GO, wherein we address function prediction as a positive-unlabeled ranking problem. We apply empirical risk minimization, i.e. we minimize the classification risk of a classifier where class priors are obtained from the Gene Ontology hierarchical structure. We show that our approach is more robust than other state-of-the-art methods on similarity-based and time-based benchmark datasets. AVAILABILITY AND IMPLEMENTATION Data and code are available at https://github.com/bio-ontology-research-group/PU-GO.
Collapse
Affiliation(s)
- Fernando Zhapa-Camacho
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| | - Zhenwei Tang
- Department of Computer Science, University of Toronto, Toronto, ON M5S 1A1, Canada
| | - Maxat Kulmanov
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| | - Robert Hoehndorf
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| |
Collapse
|
10
|
Yang Z, Wang L, Zhang X, Zeng B, Zhang Z, Liu X. LCASPMDA: a computational model for predicting potential microbe-drug associations based on learnable graph convolutional attention networks and self-paced iterative sampling ensemble. Front Microbiol 2024; 15:1366272. [PMID: 38846568 PMCID: PMC11153849 DOI: 10.3389/fmicb.2024.1366272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Accepted: 05/06/2024] [Indexed: 06/09/2024] Open
Abstract
Introduction Numerous studies show that microbes in the human body are very closely linked to the human host and can affect the human host by modulating the efficacy and toxicity of drugs. However, discovering potential microbe-drug associations through traditional wet labs is expensive and time-consuming, hence, it is important and necessary to develop effective computational models to detect possible microbe-drug associations. Methods In this manuscript, we proposed a new prediction model named LCASPMDA by combining the learnable graph convolutional attention network and the self-paced iterative sampling ensemble strategy to infer latent microbe-drug associations. In LCASPMDA, we first constructed a heterogeneous network based on newly downloaded known microbe-drug associations. Then, we adopted the learnable graph convolutional attention network to learn the hidden features of nodes in the heterogeneous network. After that, we utilized the self-paced iterative sampling ensemble strategy to select the most informative negative samples to train the Multi-Layer Perceptron classifier and put the newly-extracted hidden features into the trained MLP classifier to infer possible microbe-drug associations. Results and discussion Intensive experimental results on two different public databases including the MDAD and the aBiofilm showed that LCASPMDA could achieve better performance than state-of-the-art baseline methods in microbe-drug association prediction.
Collapse
Affiliation(s)
| | - Lei Wang
- Big Data Innovation and Entrepreneurship Education Center of Hunan Province, Changsha University, Changsha, China
| | | | | | - Zhen Zhang
- Big Data Innovation and Entrepreneurship Education Center of Hunan Province, Changsha University, Changsha, China
| | - Xin Liu
- Big Data Innovation and Entrepreneurship Education Center of Hunan Province, Changsha University, Changsha, China
| |
Collapse
|
11
|
Ansari M, White AD. Learning peptide properties with positive examples only. DIGITAL DISCOVERY 2024; 3:977-986. [PMID: 38756224 PMCID: PMC11094695 DOI: 10.1039/d3dd00218g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Accepted: 03/30/2024] [Indexed: 05/18/2024]
Abstract
Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties via positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive-negative (PN) classification approach, where there is access to both positive and negative examples.
Collapse
Affiliation(s)
- Mehrad Ansari
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| | - Andrew D White
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| |
Collapse
|
12
|
Xu S, Kelkar NS, Ackerman ME. Positive-unlabeled learning to infer protection status and identify correlates in vaccine efficacy field trials. iScience 2024; 27:109086. [PMID: 39295637 PMCID: PMC11409573 DOI: 10.1016/j.isci.2024.109086] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 11/29/2023] [Accepted: 01/29/2024] [Indexed: 09/21/2024] Open
Abstract
Correlates of protection (CoPs) are key guideposts that both support vaccine development and licensure as well as improve our understanding of the attributes of immune responses that may directly provide protection. Unfortunately, factors such as low rate of exposure and low efficacy can result in low power to discover correlates in field trials-making it difficult to identify these guideposts for the pathogens against which there is greatest need for further insights. To address this gap, we examine the ability of positive-unlabeled (PU) learning approaches to use immunogenicity data and infection status outcomes to accurately predict protection status. We report a combination of PU bagging and two-step reliable negative techniques that accurately classify the protection status of unlabeled (uninfected) samples from synthetic and real-world humoral immune response profiles in human trials and animal models and lead to the discovery of CoPs that are "missed" using conventional infection status case-control analysis.
Collapse
Affiliation(s)
- Shiwei Xu
- Quantitative Biological Sciences Program, Dartmouth College, Hanover, NH 03755, USA
| | - Natasha S. Kelkar
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Dartmouth College, Hanover, NH 03755, USA
| | - Margaret E. Ackerman
- Quantitative Biological Sciences Program, Dartmouth College, Hanover, NH 03755, USA
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Dartmouth College, Hanover, NH 03755, USA
- Thayer School of Engineering, Dartmouth College, Hanover, NH 03755, USA
| |
Collapse
|
13
|
Xie J, Rao J, Xie J, Zhao H, Yang Y. Predicting disease-gene associations through self-supervised mutual infomax graph convolution network. Comput Biol Med 2024; 170:108048. [PMID: 38310804 DOI: 10.1016/j.compbiomed.2024.108048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 12/19/2023] [Accepted: 01/26/2024] [Indexed: 02/06/2024]
Abstract
Illuminating associations between diseases and genes can help reveal the pathogenesis of syndromes and contribute to treatments, but a large number of associations remained unexplored. To identify novel disease-gene associations, many computational methods have been developed using disease and gene-related prior knowledge. However, these methods remain of relatively inferior performance due to the limited external data sources and the inevitable noise among the prior knowledge. In this study, we have developed a new method, Self-Supervised Mutual Infomax Graph Convolution Network (MiGCN), to predict disease-gene associations under the guidance of external disease-disease and gene-gene collaborative graphs. The noises within the collaborative graphs were eliminated by maximizing the mutual information between nodes and neighbors through a graphical mutual infomax layer. In parallel, the node interactions were strengthened by a novel informative message passing layer to improve the learning ability of graph neural network. The extensive experiments showed that our model achieved performance improvement over the state-of-art method by more than 8 % on AUC. The datasets, source codes and trained models of MiGCN are available at https://github.com/biomed-AI/MiGCN.
Collapse
Affiliation(s)
- Jiancong Xie
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510000, China
| | - Jiahua Rao
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510000, China
| | - Junjie Xie
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510000, China
| | - Huiying Zhao
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510000, China.
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510000, China.
| |
Collapse
|
14
|
Zhao Y, Yin J, Zhang L, Zhang Y, Chen X. Drug-drug interaction prediction: databases, web servers and computational models. Brief Bioinform 2023; 25:bbad445. [PMID: 38113076 PMCID: PMC10782925 DOI: 10.1093/bib/bbad445] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Revised: 10/26/2023] [Accepted: 11/14/2023] [Indexed: 12/21/2023] Open
Abstract
In clinical treatment, two or more drugs (i.e. drug combination) are simultaneously or successively used for therapy with the purpose of primarily enhancing the therapeutic efficacy or reducing drug side effects. However, inappropriate drug combination may not only fail to improve efficacy, but even lead to adverse reactions. Therefore, according to the basic principle of improving the efficacy and/or reducing adverse reactions, we should study drug-drug interactions (DDIs) comprehensively and thoroughly so as to reasonably use drug combination. In this review, we first introduced the basic conception and classification of DDIs. Further, some important publicly available databases and web servers about experimentally verified or predicted DDIs were briefly described. As an effective auxiliary tool, computational models for predicting DDIs can not only save the cost of biological experiments, but also provide relevant guidance for combination therapy to some extent. Therefore, we summarized three types of prediction models (including traditional machine learning-based models, deep learning-based models and score function-based models) proposed during recent years and discussed the advantages as well as limitations of them. Besides, we pointed out the problems that need to be solved in the future research of DDIs prediction and provided corresponding suggestions.
Collapse
Affiliation(s)
- Yan Zhao
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Jun Yin
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Li Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Yong Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Xing Chen
- School of Science, Jiangnan University, Wuxi 214122, China
| |
Collapse
|
15
|
Molotkov I, Artomov M. Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study. BIOINFORMATICS ADVANCES 2023; 3:vbad128. [PMID: 37745001 PMCID: PMC10517638 DOI: 10.1093/bioadv/vbad128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 08/13/2023] [Accepted: 09/12/2023] [Indexed: 09/26/2023]
Abstract
Motivation Positive-unlabeled data consists of points with either positive or unknown labels. It is widespread in medical, genetic, and biological settings, creating a high demand for predictive positive-unlabeled models. The performance of such models is usually estimated using validation sets, assumed to be selected completely at random (SCAR) from known positive examples. For certain metrics, this assumption enables unbiased performance estimation when treating positive-unlabeled data as positive/negative. However, the SCAR assumption is often adopted without proper justifications, simply for the sake of convenience. Results We provide an algorithm that under the weak assumptions of a lower bound on the number of positive examples can test for the violation of the SCAR assumption. Applying it to the problem of gene prioritization for complex genetic traits, we illustrate that the SCAR assumption is often violated there, causing the inflation of performance estimates, which we refer to as validation bias. We estimate the potential impact of validation bias on performance estimation. Our analysis reveals that validation bias is widespread in gene prioritization data and can significantly overestimate the performance of models. This finding elucidates the discrepancy between the reported good performance of models and their limited practical applications. Availability and implementation Python code with examples of application of the validation bias detection algorithm is available at github.com/ArtomovLab/ValidationBias.
Collapse
Affiliation(s)
- Ivan Molotkov
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, United States
- Department of Pediatrics, The Ohio State University, Columbus, OH, United States
- ITMO University, Saint Petersburg, Russia
| | - Mykyta Artomov
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, United States
- Department of Pediatrics, The Ohio State University, Columbus, OH, United States
| |
Collapse
|
16
|
Mastropietro A, De Carlo G, Anagnostopoulos A. XGDAG: explainable gene-disease associations via graph neural networks. Bioinformatics 2023; 39:btad482. [PMID: 37531293 PMCID: PMC10421968 DOI: 10.1093/bioinformatics/btad482] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Revised: 06/27/2023] [Accepted: 08/01/2023] [Indexed: 08/04/2023] Open
Abstract
MOTIVATION Disease gene prioritization consists in identifying genes that are likely to be involved in the mechanisms of a given disease, providing a ranking of such genes. Recently, the research community has used computational methods to uncover unknown gene-disease associations; these methods range from combinatorial to machine learning-based approaches. In particular, during the last years, approaches based on deep learning have provided superior results compared to more traditional ones. Yet, the problem with these is their inherent black-box structure, which prevents interpretability. RESULTS We propose a new methodology for disease gene discovery, which leverages graph-structured data using graph neural networks (GNNs) along with an explainability phase for determining the ranking of candidate genes and understanding the model's output. Our approach is based on a positive-unlabeled learning strategy, which outperforms existing gene discovery methods by exploiting GNNs in a non-black-box fashion. Our methodology is effective even in scenarios where a large number of associated genes need to be retrieved, in which gene prioritization methods often tend to lose their reliability. AVAILABILITY AND IMPLEMENTATION The source code of XGDAG is available on GitHub at: https://github.com/GiDeCarlo/XGDAG. The data underlying this article are available at: https://www.disgenet.org/, https://thebiogrid.org/, https://doi.org/10.1371/journal.pcbi.1004120.s003, and https://doi.org/10.1371/journal.pcbi.1004120.s004.
Collapse
Affiliation(s)
- Andrea Mastropietro
- Department of Computer, Control and Management Engineering “Antonio Ruberti”, Sapienza University of Rome, Rome 00185, Italy
| | - Gianluca De Carlo
- Department of Computer, Control and Management Engineering “Antonio Ruberti”, Sapienza University of Rome, Rome 00185, Italy
| | - Aris Anagnostopoulos
- Department of Computer, Control and Management Engineering “Antonio Ruberti”, Sapienza University of Rome, Rome 00185, Italy
| |
Collapse
|
17
|
Chandra O, Sharma M, Pandey N, Jha IP, Mishra S, Kong SL, Kumar V. Patterns of transcription factor binding and epigenome at promoters allow interpretable predictability of multiple functions of non-coding and coding genes. Comput Struct Biotechnol J 2023; 21:3590-3603. [PMID: 37520281 PMCID: PMC10371796 DOI: 10.1016/j.csbj.2023.07.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 07/05/2023] [Accepted: 07/11/2023] [Indexed: 08/01/2023] Open
Abstract
Understanding the biological roles of all genes only through experimental methods is challenging. A computational approach with reliable interpretability is needed to infer the function of genes, particularly for non-coding RNAs. We have analyzed genomic features that are present across both coding and non-coding genes like transcription factor (TF) and cofactor ChIP-seq (823), histone modifications ChIP-seq (n = 621), cap analysis gene expression (CAGE) tags (n = 255), and DNase hypersensitivity profiles (n = 255) to predict ontology-based functions of genes. Our approach for gene function prediction was reliable (>90% balanced accuracy) for 486 gene-sets. PubMed abstract mining and CRISPR screens supported the inferred association of genes with biological functions, for which our method had high accuracy. Further analysis revealed that TF-binding patterns at promoters have high predictive strength for multiple functions. TF-binding patterns at the promoter add an unexplored dimension of explainable regulatory aspects of genes and their functions. Therefore, we performed a comprehensive analysis for the functional-specificity of TF-binding patterns at promoters and used them for clustering functions to reveal many latent groups of gene-sets involved in common major cellular processes. We also showed how our approach could be used to infer the functions of non-coding genes using the CRISPR screens of coding genes, which were validated using a long non-coding RNA CRISPR screen. Thus our results demonstrated the generality of our approach by using gene-sets from CRISPR screens. Overall, our approach opens an avenue for predicting the involvement of non-coding genes in various functions.
Collapse
Affiliation(s)
- Omkar Chandra
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Ph-III, New Delhi, India
| | - Madhu Sharma
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Ph-III, New Delhi, India
| | - Neetesh Pandey
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Ph-III, New Delhi, India
| | - Indra Prakash Jha
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Ph-III, New Delhi, India
| | - Shreya Mishra
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Ph-III, New Delhi, India
| | - Say Li Kong
- Genome Institute of Singapore, Agency for Science Technology and Research, Singapore, Singapore
| | - Vibhor Kumar
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Ph-III, New Delhi, India
| |
Collapse
|
18
|
Ansari M, White AD. Learning Peptide Properties with Positive Examples Only. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.01.543289. [PMID: 37333233 PMCID: PMC10274696 DOI: 10.1101/2023.06.01.543289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/20/2023]
Abstract
Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties via positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive-negative (PN) classification approach, where there is access to both positive and negative examples.
Collapse
Affiliation(s)
- Mehrad Ansari
- Department of Chemical Engineering, University of Rochester, Rochester, NY, 14627, USA
| | - Andrew D. White
- Department of Chemical Engineering, University of Rochester, Rochester, NY, 14627, USA
| |
Collapse
|
19
|
Wang R, Liang Y, Miao Z, Liu T. BAYESIAN ANALYSIS FOR IMBALANCED POSITIVE-UNLABELLED DIAGNOSIS CODES IN ELECTRONIC HEALTH RECORDS. Ann Appl Stat 2023; 17:1220-1238. [PMID: 37152904 PMCID: PMC10156089 DOI: 10.1214/22-aoas1666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
With the increasing availability of electronic health records (EHR), significant progress has been made on developing predictive inference and algorithms by health data analysts and researchers. However, the EHR data are notoriously noisy due to missing and inaccurate inputs despite the information is abundant. One serious problem is that only a small portion of patients in the database has confirmatory diagnoses while many other patients remain undiagnosed because they did not comply with the recommended examinations. The phenomenon leads to a so-called positive-unlabelled situation and the labels are extremely imbalanced. In this paper, we propose a model-based approach to classify the unlabelled patients by using a Bayesian finite mixture model. We also discuss the label switching issue for the imbalanced data and propose a consensus Monte Carlo approach to address the imbalance issue and improve computational efficiency simultaneously. Simulation studies show that our proposed model-based approach outperforms existing positive-unlabelled learning algorithms. The proposed method is applied on the Cerner EHR for detecting diabetic retinopathy (DR) patients using laboratory measurements. With only 3% confirmatory diagnoses in the EHR database, we estimate the actual DR prevalence to be 25% which coincides with reported findings in the medical literature.
Collapse
Affiliation(s)
- Ru Wang
- Department of Statistics, Oklahoma State University
| | - Ye Liang
- Department of Statistics, Oklahoma State University
| | - Zhuqi Miao
- School of Business, State University of New York at New Paltz
| | - Tieming Liu
- School of Industrial Engineering and Management, Oklahoma State University
| |
Collapse
|
20
|
Wu X, Deng H, Wang Q, Lei L, Gao Y, Hao G. Meta-learning shows great potential in plant disease recognition under few available samples. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2023; 114:767-782. [PMID: 36883481 DOI: 10.1111/tpj.16176] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Revised: 02/15/2023] [Accepted: 02/23/2023] [Indexed: 05/27/2023]
Abstract
Plant diseases worsen the threat of food shortage with the growing global population, and disease recognition is the basis for the effective prevention and control of plant diseases. Deep learning has made significant breakthroughs in the field of plant disease recognition. Compared with traditional deep learning, meta-learning can still maintain more than 90% accuracy in disease recognition with small samples. However, there is no comprehensive review on the application of meta-learning in plant disease recognition. Here, we mainly summarize the functions, advantages, and limitations of meta-learning research methods and their applications for plant disease recognition with a few data scenarios. Finally, we outline several research avenues for utilizing current and future meta-learning in plant science. This review may help plant science researchers obtain faster, more accurate, and more credible solutions through deep learning with fewer labeled samples.
Collapse
Affiliation(s)
- Xue Wu
- National Key Laboratory of Green Pesticide, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for Research and Development of Fine Chemicals, State Key Laboratory of Public Big Data, Guizhou University, Guiyang, 550025, Guizhou, China
| | - Hongyu Deng
- National Key Laboratory of Green Pesticide, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for Research and Development of Fine Chemicals, State Key Laboratory of Public Big Data, Guizhou University, Guiyang, 550025, Guizhou, China
| | - Qi Wang
- National Key Laboratory of Green Pesticide, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for Research and Development of Fine Chemicals, State Key Laboratory of Public Big Data, Guizhou University, Guiyang, 550025, Guizhou, China
| | - Liang Lei
- School of Physics & Optoelectronic Engineering, Guangdong University of Technology, Guangzhou, 550000, Guangzhou, China
| | - Yangyang Gao
- National Key Laboratory of Green Pesticide, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for Research and Development of Fine Chemicals, State Key Laboratory of Public Big Data, Guizhou University, Guiyang, 550025, Guizhou, China
| | - Gefei Hao
- National Key Laboratory of Green Pesticide, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for Research and Development of Fine Chemicals, State Key Laboratory of Public Big Data, Guizhou University, Guiyang, 550025, Guizhou, China
| |
Collapse
|
21
|
Tian Z, Yu Y, Fang H, Xie W, Guo M. Predicting microbe-drug associations with structure-enhanced contrastive learning and self-paced negative sampling strategy. Brief Bioinform 2023; 24:7009077. [PMID: 36715986 DOI: 10.1093/bib/bbac634] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 12/19/2022] [Accepted: 12/29/2022] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION Predicting the associations between human microbes and drugs (MDAs) is one critical step in drug development and precision medicine areas. Since discovering these associations through wet experiments is time-consuming and labor-intensive, computational methods have already been an effective way to tackle this problem. Recently, graph contrastive learning (GCL) approaches have shown great advantages in learning the embeddings of nodes from heterogeneous biological graphs (HBGs). However, most GCL-based approaches don't fully capture the rich structure information in HBGs. Besides, fewer MDA prediction methods could screen out the most informative negative samples for effectively training the classifier. Therefore, it still needs to improve the accuracy of MDA predictions. RESULTS In this study, we propose a novel approach that employs the Structure-enhanced Contrastive learning and Self-paced negative sampling strategy for Microbe-Drug Association predictions (SCSMDA). Firstly, SCSMDA constructs the similarity networks of microbes and drugs, as well as their different meta-path-induced networks. Then SCSMDA employs the representations of microbes and drugs learned from meta-path-induced networks to enhance their embeddings learned from the similarity networks by the contrastive learning strategy. After that, we adopt the self-paced negative sampling strategy to select the most informative negative samples to train the MLP classifier. Lastly, SCSMDA predicts the potential microbe-drug associations with the trained MLP classifier. The embeddings of microbes and drugs learning from the similarity networks are enhanced with the contrastive learning strategy, which could obtain their discriminative representations. Extensive results on three public datasets indicate that SCSMDA significantly outperforms other baseline methods on the MDA prediction task. Case studies for two common drugs could further demonstrate the effectiveness of SCSMDA in finding novel MDA associations. AVAILABILITY The source code is publicly available on GitHub https://github.com/Yue-Yuu/SCSMDA-master.
Collapse
Affiliation(s)
- Zhen Tian
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450000, China
| | - Yue Yu
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450000, China
| | - Haichuan Fang
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450000, China
| | - Weixin Xie
- Institute of Intelligent System and Bioinformatics, College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin, 150000, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, 100044, Beijing, China
| |
Collapse
|
22
|
Wang H, Han J, Li H, Duan L, Liu Z, Cheng H. CDA-SKAG: Predicting circRNA-disease associations using similarity kernel fusion and an attention-enhancing graph autoencoder. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:7957-7980. [PMID: 37161181 DOI: 10.3934/mbe.2023345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Circular RNAs (circRNAs) constitute a category of circular non-coding RNA molecules whose abnormal expression is closely associated with the development of diseases. As biological data become abundant, a lot of computational prediction models have been used for circRNA-disease association prediction. However, existing prediction models ignore the non-linear information of circRNAs and diseases when fusing multi-source similarities. In addition, these models fail to take full advantage of the vital feature information of high-similarity neighbor nodes when extracting features of circRNAs or diseases. In this paper, we propose a deep learning model, CDA-SKAG, which introduces a similarity kernel fusion algorithm to integrate multi-source similarity matrices to capture the non-linear information of circRNAs or diseases, and construct a circRNA information space and a disease information space. The model embeds an attention-enhancing layer in the graph autoencoder to enhance the associations between nodes with higher similarity. A cost-sensitive neural network is introduced to address the problem of positive and negative sample imbalance, consequently improving our model's generalization capability. The experimental results show that the prediction performance of our model CDA-SKAG outperformed existing circRNA-disease association prediction models. The results of the case studies on lung and cervical cancer suggest that CDA-SKAG can be utilized as an effective tool to assist in predicting circRNA-disease associations.
Collapse
Affiliation(s)
- Huiqing Wang
- College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
| | - Jiale Han
- College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
| | - Haolin Li
- College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
| | - Liguo Duan
- College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
| | - Zhihao Liu
- College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
| | - Hao Cheng
- College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
| |
Collapse
|
23
|
Stolfi P, Mastropietro A, Pasculli G, Tieri P, Vergni D. NIAPU: network-informed adaptive positive-unlabeled learning for disease gene identification. Bioinformatics 2023; 39:7023926. [PMID: 36727493 PMCID: PMC9933847 DOI: 10.1093/bioinformatics/btac848] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 12/23/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Gene-disease associations are fundamental for understanding disease etiology and developing effective interventions and treatments. Identifying genes not yet associated with a disease due to a lack of studies is a challenging task in which prioritization based on prior knowledge is an important element. The computational search for new candidate disease genes may be eased by positive-unlabeled learning, the machine learning (ML) setting in which only a subset of instances are labeled as positive while the rest of the dataset is unlabeled. In this work, we propose a set of effective network-based features to be used in a novel Markov diffusion-based multi-class labeling strategy for putative disease gene discovery. RESULTS The performances of the new labeling algorithm and the effectiveness of the proposed features have been tested on 10 different disease datasets using three ML algorithms. The new features have been compared against classical topological and functional/ontological features and a set of network- and biological-derived features already used in gene discovery tasks. The predictive power of the integrated methodology in searching for new disease genes has been found to be competitive against state-of-the-art algorithms. AVAILABILITY AND IMPLEMENTATION The source code of NIAPU can be accessed at https://github.com/AndMastro/NIAPU. The source data used in this study are available online on the respective websites. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Paola Stolfi
- Institute for Applied Computing (IAC) 'Mauro Picone', National Research Council of Italy (CNR), Rome 00185, Italy
| | - Andrea Mastropietro
- Department of Computer, Control and Management Engineering (DIAG) 'Antonio Ruberti', Sapienza University of Rome, Rome 00185, Italy
| | - Giuseppe Pasculli
- Department of Computer, Control and Management Engineering (DIAG) 'Antonio Ruberti', Sapienza University of Rome, Rome 00185, Italy
| | - Paolo Tieri
- Institute for Applied Computing (IAC) 'Mauro Picone', National Research Council of Italy (CNR), Rome 00185, Italy
| | - Davide Vergni
- Institute for Applied Computing (IAC) 'Mauro Picone', National Research Council of Italy (CNR), Rome 00185, Italy
| |
Collapse
|
24
|
He Y, Li X, Zhang M, Fournier‐Viger P, Huang JZ, Salloum S. A novel observation points‐based positive‐unlabeled learning algorithm. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY 2023. [DOI: 10.1049/cit2.12152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Affiliation(s)
- Yulin He
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) Shenzhen China
- College of Computer Science and Software Engineering Shenzhen University Shenzhen China
| | - Xu Li
- College of Computer Science and Software Engineering Shenzhen University Shenzhen China
| | - Manjing Zhang
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) Shenzhen China
| | | | - Joshua Zhexue Huang
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) Shenzhen China
- College of Computer Science and Software Engineering Shenzhen University Shenzhen China
| | - Salman Salloum
- School of Computing National University of Singapore Singapore Singapore
| |
Collapse
|
25
|
Sidorczuk K, Gagat P, Pietluch F, Kała J, Rafacz D, Bąkała L, Słowik J, Kolenda R, Rödiger S, Fingerhut LCHW, Cooke IR, Mackiewicz P, Burdukiewicz M. Benchmarks in antimicrobial peptide prediction are biased due to the selection of negative data. Brief Bioinform 2022; 23:6672903. [PMID: 35988923 PMCID: PMC9487607 DOI: 10.1093/bib/bbac343] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 07/07/2022] [Accepted: 07/25/2022] [Indexed: 12/29/2022] Open
Abstract
Antimicrobial peptides (AMPs) are a heterogeneous group of short polypeptides that target not only microorganisms but also viruses and cancer cells. Due to their lower selection for resistance compared with traditional antibiotics, AMPs have been attracting the ever-growing attention from researchers, including bioinformaticians. Machine learning represents the most cost-effective method for novel AMP discovery and consequently many computational tools for AMP prediction have been recently developed. In this article, we investigate the impact of negative data sampling on model performance and benchmarking. We generated 660 predictive models using 12 machine learning architectures, a single positive data set and 11 negative data sampling methods; the architectures and methods were defined on the basis of published AMP prediction software. Our results clearly indicate that similar training and benchmark data set, i.e. produced by the same or a similar negative data sampling method, positively affect model performance. Consequently, all the benchmark analyses that have been performed for AMP prediction models are significantly biased and, moreover, we do not know which model is the most accurate. To provide researchers with reliable information about the performance of AMP predictors, we also created a web server AMPBenchmark for fair model benchmarking. AMPBenchmark is available at http://BioGenies.info/AMPBenchmark.
Collapse
Affiliation(s)
| | | | | | - Jakub Kała
- Warsaw University of Technology, Faculty of Mathematics and Information Science, Poland
| | - Dominik Rafacz
- Warsaw University of Technology, Faculty of Mathematics and Information Science, Poland
| | - Laura Bąkała
- Warsaw University of Technology, Faculty of Mathematics and Information Science, Poland
| | - Jadwiga Słowik
- Warsaw University of Technology, Faculty of Mathematics and Information Science, Poland
| | - Rafał Kolenda
- Quadram Institute Biosciences, Norwich Research Park, Norwich, United Kingdom,Wrocław University of Environmental and Life Sciences, Faculty of Veterinary Medicine, Poland
| | - Stefan Rödiger
- Brandenburg University of Technology Cottbus-Senftenberg, Faculty of Natural Sciences, Germany
| | - Legana C H W Fingerhut
- Department of Molecular and Cell Biology, Centre for Tropical Bioinformatics and Molecular Biology, James Cook University, Australia
| | - Ira R Cooke
- Department of Molecular and Cell Biology, Centre for Tropical Bioinformatics and Molecular Biology, James Cook University, Australia
| | | | | |
Collapse
|
26
|
Arpi MNT, Simpson TI. SFARI genes and where to find them; modelling Autism Spectrum Disorder specific gene expression dysregulation with RNA-seq data. Sci Rep 2022; 12:10158. [PMID: 35710789 PMCID: PMC9203566 DOI: 10.1038/s41598-022-14077-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2021] [Accepted: 06/01/2022] [Indexed: 11/09/2022] Open
Abstract
Autism Spectrum Disorders (ASD) have a strong, yet heterogeneous, genetic component. Among the various methods that are being developed to help reveal the underlying molecular aetiology of the disease one approach that is gaining popularity is the combination of gene expression and clinical genetic data, often using the SFARI-gene database, which comprises lists of curated genes considered to have causative roles in ASD when mutated in patients. We build a gene co-expression network to study the relationship between ASD-specific transcriptomic data and SFARI genes and then analyse it at different levels of granularity. No significant evidence is found of association between SFARI genes and differential gene expression patterns when comparing ASD samples to a control group, nor statistical enrichment of SFARI genes in gene co-expression network modules that have a strong correlation with ASD diagnosis. However, classification models that incorporate topological information from the whole ASD-specific gene co-expression network can predict novel SFARI candidate genes that share features of existing SFARI genes and have support for roles in ASD in the literature. A statistically significant association is also found between the absolute level of gene expression and SFARI's genes and Scores, which can confound the analysis if uncorrected. We propose a novel approach to correct for this that is general enough to be applied to other problems affected by continuous sources of bias. It was found that only co-expression network analyses that integrate information from the whole network are able to reveal signatures linked to ASD diagnosis and novel candidate genes for the study of ASD, which individual gene or module analyses fail to do. It was also found that the influence of SFARI genes permeates not only other ASD scoring systems, but also lists of genes believed to be involved in other neurodevelopmental disorders.
Collapse
Affiliation(s)
| | - T Ian Simpson
- School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh, EH8 9AB, UK. .,Simons Initiative for the Developing Brain (SIDB), Centre for Brain Discovery Sciences, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
27
|
Weakly Supervised Anomaly Detection Based on Two-Step Cyclic Iterative PU Learning Strategy. Neural Process Lett 2022. [DOI: 10.1007/s11063-022-10815-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
28
|
Xiang Y, Luettich K, Martin F, Battey JND, Trivedi K, Neau L, Wong ET, Guedj E, Dulize R, Peric D, Bornand D, Ouadi S, Sierro N, Büttner A, Ivanov NV, Vanscheeuwijck P, Hoeng J, Peitsch MC. Discriminating Spontaneous From Cigarette Smoke and THS 2.2 Aerosol Exposure-Related Proliferative Lung Lesions in A/J Mice by Using Gene Expression and Mutation Spectrum Data. FRONTIERS IN TOXICOLOGY 2022; 3:634035. [PMID: 35295134 PMCID: PMC8915865 DOI: 10.3389/ftox.2021.634035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Accepted: 02/19/2021] [Indexed: 11/25/2022] Open
Abstract
Mice, especially A/J mice, have been widely employed to elucidate the underlying mechanisms of lung tumor formation and progression and to derive human-relevant modes of action. Cigarette smoke (CS) exposure induces tumors in the lungs; but, non-exposed A/J mice will also develop lung tumors spontaneously with age, which raises the question of discriminating CS-related lung tumors from spontaneous ones. However, the challenge is that spontaneous tumors are histologically indistinguishable from the tumors occurring in CS-exposed mice. We conducted an 18-month inhalation study in A/J mice to assess the impact of lifetime exposure to Tobacco Heating System (THS) 2.2 aerosol relative to exposure to 3R4F cigarette smoke (CS) on toxicity and carcinogenicity endpoints. To tackle the above challenge, a 13-gene gene signature was developed based on an independent A/J mouse CS exposure study, following by a one-class classifier development based on the current study. Identifying gene signature in one data set and building classifier in another data set addresses the feature/gene selection bias which is a well-known problem in literature. Applied to data from this study, this gene signature classifier distinguished tumors in CS-exposed animals from spontaneous tumors. Lung tumors from THS 2.2 aerosol-exposed mice were significantly different from those of CS-exposed mice but not from spontaneous tumors. The signature was also applied to human lung adenocarcinoma gene expression data (from The Cancer Genome Atlas) and discriminated cancers in never-smokers from those in ever-smokers, suggesting translatability of our signature genes from mice to humans. A possible application of this gene signature is to discriminate lung cancer patients who may benefit from specific treatments (i.e., EGFR tyrosine kinase inhibitors). Mutational spectra from a subset of samples were also utilized for tumor classification, yielding similar results. “Landscaping” the molecular features of A/J mouse lung tumors highlighted, for the first time, a number of events that are also known to play a role in human lung tumorigenesis, such as Lrp1b mutation and Ros1 overexpression. This study shows that omics and computational tools provide useful means of tumor classification where histopathological evaluation alone may be unsatisfactory to distinguish between age- and exposure-related lung tumors.
Collapse
Affiliation(s)
- Yang Xiang
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | - Karsta Luettich
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | - Florian Martin
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | - James N D Battey
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | - Keyur Trivedi
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | - Laurent Neau
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | - Ee Tsin Wong
- Philip Morris International R&D, Philip Morris International Research Laboratories Pte. Ltd., Singapore, Singapore
| | - Emmanuel Guedj
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | - Remi Dulize
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | - Dariusz Peric
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | - David Bornand
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | - Sonia Ouadi
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | - Nicolas Sierro
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | | | - Nikolai V Ivanov
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | | | - Julia Hoeng
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| | - Manuel C Peitsch
- Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
| |
Collapse
|
29
|
Ali SD, Tayara H, Chong KT. Identification of piRNA disease associations using deep learning. Comput Struct Biotechnol J 2022; 20:1208-1217. [PMID: 35317234 PMCID: PMC8908038 DOI: 10.1016/j.csbj.2022.02.026] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Revised: 02/24/2022] [Accepted: 02/26/2022] [Indexed: 01/09/2023] Open
Abstract
Piwi-interacting RNAs (piRNAs) play a pivotal role in maintaining genome integrity by repression of transposable elements, gene stability, and association with various disease progressions. Cost-efficient computational methods for the identification of piRNA disease associations promote the efficacy of disease-specific drug development. In this regard, we developed a simple, robust, and efficient deep learning method for identifying the piRNA disease associations known as piRDA. The proposed architecture extracts the most significant and abstract information from raw sequences represented in a simplicated piRNA disease pair without any involvement of features engineering. Two-step positive unlabeled learning and bootstrapping technique are utilized to abstain from the false-negative and biased predictions dealing with positive unlabeled data. The performance of proposed method piRDA is evaluated using k-fold cross-validation. The piRDA is significantly improved in all the performance evaluation measures for the identification of piRNA disease associations in comparison to state-of-the-art method. Moreover, it is thus projected conclusively that the proposed computational method could play a significant role as a supportive and practical tool for primitive disease mechanisms and pharmaceutical research such as in academia and drug design. Eventually, the proposed model can be accessed using publicly available and user-friendly web tool athttp://nsclbio.jbnu.ac.kr/tools/piRDA/.
Collapse
Affiliation(s)
- Syed Danish Ali
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
- The University of Azad Jammu and Kashmir, Muzaffarabad 13100, Pakistan
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
30
|
Park H, Kang Y, Choe W, Kim J. Mining Insights on Metal-Organic Framework Synthesis from Scientific Literature Texts. J Chem Inf Model 2022; 62:1190-1198. [PMID: 35195419 DOI: 10.1021/acs.jcim.1c01297] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Identifying optimal synthesis conditions for metal-organic frameworks (MOFs) is a major challenge that can serve as a bottleneck for new materials discovery and development. A trial-and-error approach that relies on a chemist's intuition and knowledge has limitations in efficiency due to the large MOF synthesis space. To this end, 46,701 MOFs were data mined using our in-house developed code to extract their synthesis information from 28,565 MOF papers. The joint machine-learning/rule-based algorithm yields an average F1 score of 90.3% across different synthesis parameters (i.e., metal precursors, organic precursors, solvents, temperature, time, and composition). From this data set, a positive-unlabeled learning algorithm was developed to predict the synthesis of a given MOF material using synthesis conditions as inputs, and this algorithm successfully predicted successful synthesis in 83.1% of the synthesized data in the test set. Finally, our model correctly predicted three amorphous MOFs (with their representative experimental synthesis conditions) as having low synthesizability scores, while the counterpart crystalline MOFs showed high synthesizability scores. Our results show that big data extracted from the texts of MOF papers can be used to rationally predict synthesis conditions for these materials, which can accelerate the speed in which new MOFs are synthesized.
Collapse
Affiliation(s)
- Hyunsoo Park
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291, Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Yeonghun Kang
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291, Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Wonyoung Choe
- Department of Chemistry, Ulsan National Institute of Science and Technology (UNIST), 50, UNIST-gil, Eonyang-eup, Ulju-gun, Ulsan 44919, Republic of Korea
| | - Jihan Kim
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291, Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| |
Collapse
|
31
|
Machine learning prediction and tau-based screening identifies potential Alzheimer's disease genes relevant to immunity. Commun Biol 2022; 5:125. [PMID: 35149761 PMCID: PMC8837797 DOI: 10.1038/s42003-022-03068-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Accepted: 01/21/2022] [Indexed: 12/19/2022] Open
Abstract
With increased research funding for Alzheimer's disease (AD) and related disorders across the globe, large amounts of data are being generated. Several studies employed machine learning methods to understand the ever-growing omics data to enhance early diagnosis, map complex disease networks, or uncover potential drug targets. We describe results based on a Target Central Resource Database protein knowledge graph and evidence paths transformed into vectors by metapath matching. We extracted features between specific genes and diseases, then trained and optimized our model using XGBoost, termed MPxgb(AD). To determine our MPxgb(AD) prediction performance, we examined the top twenty predicted genes through an experimental screening pipeline. Our analysis identified potential AD risk genes: FRRS1, CTRAM, SCGB3A1, FAM92B/CIBAR2, and TMEFF2. FRRS1 and FAM92B are considered dark genes, while CTRAM, SCGB3A1, and TMEFF2 are connected to TREM2-TYROBP, IL-1β-TNFα, and MTOR-APP AD-risk nodes, suggesting relevance to the pathogenesis of AD.
Collapse
|
32
|
Li F, Dong S, Leier A, Han M, Guo X, Xu J, Wang X, Pan S, Jia C, Zhang Y, Webb GI, Coin LJM, Li C, Song J. Positive-unlabeled learning in bioinformatics and computational biology: a brief review. Brief Bioinform 2021; 23:6415313. [PMID: 34729589 DOI: 10.1093/bib/bbab461] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 09/27/2021] [Accepted: 10/07/2021] [Indexed: 12/14/2022] Open
Abstract
Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.
Collapse
Affiliation(s)
- Fuyi Li
- Monash University, Australia
| | | | - André Leier
- Department of Genetics, UAB School of Medicine, USA
| | - Meiya Han
- Department of Biochemistry and Molecular Biology, Monash University, Australia
| | | | - Jing Xu
- Computer Science and Technology from Nankai University, China
| | - Xiaoyu Wang
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia
| | - Shirui Pan
- University of Technology Sydney (UTS), Ultimo, NSW, Australia
| | - Cangzhi Jia
- College of Science, Dalian Maritime University, Australia
| | - Yang Zhang
- Northwestern Polytechnical University, China
| | - Geoffrey I Webb
- Faculty of Information Technology at Monash University, Australia
| | - Lachlan J M Coin
- Department of Clinical Pathology, University of Melbourne, Australia
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry of Molecular Biology, Monash University, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, Australia
| |
Collapse
|
33
|
Yang H, Ding Y, Tang J, Guo F. Identifying potential association on gene-disease network via dual hypergraph regularized least squares. BMC Genomics 2021; 22:605. [PMID: 34372777 PMCID: PMC8351363 DOI: 10.1186/s12864-021-07864-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Accepted: 06/29/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Identifying potential associations between genes and diseases via biomedical experiments must be the time-consuming and expensive research works. The computational technologies based on machine learning models have been widely utilized to explore genetic information related to complex diseases. Importantly, the gene-disease association detection can be defined as the link prediction problem in bipartite network. However, many existing methods do not utilize multiple sources of biological information; Additionally, they do not extract higher-order relationships among genes and diseases. RESULTS In this study, we propose a novel method called Dual Hypergraph Regularized Least Squares (DHRLS) with Centered Kernel Alignment-based Multiple Kernel Learning (CKA-MKL), in order to detect all potential gene-disease associations. First, we construct multiple kernels based on various biological data sources in gene and disease spaces respectively. After that, we use CAK-MKL to obtain the optimal kernels in the two spaces respectively. To specific, hypergraph can be employed to establish higher-order relationships. Finally, our DHRLS model is solved by the Alternating Least squares algorithm (ALSA), for predicting gene-disease associations. CONCLUSION Comparing with many outstanding prediction tools, DHRLS achieves best performance on gene-disease associations network under two types of cross validation. To verify robustness, our proposed approach has excellent prediction performance on six real-world networks. Our research work can effectively discover potential disease-associated genes and provide guidance for the follow-up verification methods of complex diseases.
Collapse
Affiliation(s)
- Hongpeng Yang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yijie Ding
- Yangtze Delta Region Institute, University of Electronic Science and Technology of China, Quzhou, China.
| | - Jijun Tang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Changsha, China.
| |
Collapse
|
34
|
A Two-Step Classification Method Based on Collaborative Representation for Positive and Unlabeled Learning. Neural Process Lett 2021. [DOI: 10.1007/s11063-021-10590-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
35
|
Mu H, Sun R, Yuan G, Shi G. Positive unlabeled learning‐based anomaly detection in videos. INT J INTELL SYST 2021. [DOI: 10.1002/int.22437] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Huiyu Mu
- College of Information and Electrical Engineering China Agricultural University Beijing China
| | - Ruizhi Sun
- College of Information and Electrical Engineering China Agricultural University Beijing China
- Scientific Research Base for Integrated Technologies of Precision Agriculture (Animal Husbandry) Ministry of Agriculture Beijing China
| | - Gang Yuan
- College of Information and Electrical Engineering China Agricultural University Beijing China
| | - Guoqing Shi
- College of Information and Electrical Engineering China Agricultural University Beijing China
| |
Collapse
|
36
|
Abstract
AbstractParkinson’s disease (PD) genes identification plays an important role in improving the diagnosis and treatment of the disease. A number of machine learning methods have been proposed to identify disease-related genes, but only few of these methods are adopted for PD. This work puts forth a novel neural network-based ensemble (n-semble) method to identify Parkinson’s disease genes. The artificial neural network is trained in a unique way to ensemble the multiple model predictions. The proposed n-semble method is composed of four parts: (1) protein sequences are used to construct feature vectors using physicochemical properties of amino acid; (2) dimensionality reduction is achieved using the t-Distributed Stochastic Neighbor Embedding (t-SNE) method, (3) the Jaccard method is applied to find likely negative samples from unknown (candidate) genes, and (4) gene prediction is performed with n-semble method. The proposed n-semble method has been compared with Smalter’s, ProDiGe, PUDI and EPU methods using various evaluation metrics. It has been concluded that the proposed n-semble method outperforms the existing gene identification methods over the other methods and achieves significantly higher precision, recall and F Score of 88.9%, 90.9% and 89.8%, respectively. The obtained results confirm the effectiveness and validity of the proposed framework.
Collapse
|
37
|
Li Z, Hu L, Tang Z, Zhao C. Predicting HIV-1 Protease Cleavage Sites With Positive-Unlabeled Learning. Front Genet 2021; 12:658078. [PMID: 33868387 PMCID: PMC8044780 DOI: 10.3389/fgene.2021.658078] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 03/08/2021] [Indexed: 11/13/2022] Open
Abstract
Understanding the substrate specificity of HIV-1 protease plays an essential role in the prevention of HIV infection. A variety of computational models have thus been developed to predict substrate sites that are cleaved by HIV-1 protease, but most of them normally follow a supervised learning scheme to build classifiers by considering experimentally verified cleavable sites as positive samples and unknown sites as negative samples. However, certain noisy can be contained in the negative set, as false negative samples are possibly existed. Hence, the performance of the classifiers is not as accurate as they could be due to the biased prediction results. In this work, unknown substrate sites are regarded as unlabeled samples instead of negative ones. We propose a novel positive-unlabeled learning algorithm, namely PU-HIV, for an effective prediction of HIV-1 protease cleavage sites. Features used by PU-HIV are encoded from different perspectives of substrate sequences, including amino acid identities, coevolutionary patterns and chemical properties. By adjusting the weights of errors generated by positive and unlabeled samples, a biased support vector machine classifier can be built to complete the prediction task. In comparison with state-of-the-art prediction models, benchmarking experiments using cross-validation and independent tests demonstrated the superior performance of PU-HIV in terms of AUC, PR-AUC, and F-measure. Thus, with PU-HIV, it is possible to identify previously unknown, but physiologically existed substrate sites that are able to be cleaved by HIV-1 protease, thus providing valuable insights into designing novel HIV-1 protease inhibitors for HIV treatment.
Collapse
Affiliation(s)
- Zhenfeng Li
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Lun Hu
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Zehai Tang
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Cheng Zhao
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| |
Collapse
|
38
|
Luo P, Chen B, Liao B, Wu F. Predicting disease‐associated genes: Computational methods, databases, and evaluations. WIRES DATA MINING AND KNOWLEDGE DISCOVERY 2021; 11. [DOI: 10.1002/widm.1383] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2019] [Accepted: 06/13/2020] [Indexed: 09/09/2024]
Abstract
AbstractComplex diseases are associated with a set of genes (called disease genes), the identification of which can help scientists uncover the mechanisms of diseases and develop new drugs and treatment strategies. Due to the huge cost and time of experimental identification techniques, many computational algorithms have been proposed to predict disease genes. Although several review publications in recent years have discussed many computational methods, some of them focus on cancer driver genes while others focus on biomolecular networks, which only cover a specific aspect of existing methods. In this review, we summarize existing methods and classify them into three categories based on their rationales. Then, the algorithms, biological data, and evaluation methods used in the computational prediction are discussed. Finally, we highlight the limitations of existing methods and point out some future directions for improving these algorithms. This review could help investigators understand the principles of existing methods, and thus develop new methods to advance the computational prediction of disease genes.This article is categorized under:Technologies > Machine LearningTechnologies > PredictionAlgorithmic Development > Biological Data Mining
Collapse
Affiliation(s)
- Ping Luo
- Division of Biomedical Engineering University of Saskatchewan Saskatoon Canada
- Princess Margaret Cancer Centre University Health Network Toronto Canada
| | - Bolin Chen
- School of Computer Science and Technology Northwestern Polytechnical University China
| | - Bo Liao
- School of Mathematics and Statistics Hainan Normal University Haikou China
| | - Fang‐Xiang Wu
- Department of Mechanical Engineering and Department of Computer Science University of Saskatchewan Saskatoon Canada
| |
Collapse
|
39
|
Gong C, Shi H, Liu T, Zhang C, Yang J, Tao D. Loss Decomposition and Centroid Estimation for Positive and Unlabeled Learning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:918-932. [PMID: 31535983 DOI: 10.1109/tpami.2019.2941684] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This paper studies Positive and Unlabeled learning (PU learning), of which the target is to build a binary classifier where only positive data and unlabeled data are available for classifier training. To deal with the absence of negative training data, we first regard all unlabeled data as negative examples with false negative labels, and then convert PU learning into the risk minimization problem in the presence of such one-side label noise. Specifically, we propose a novel PU learning algorithm dubbed "Loss Decomposition and Centroid Estimation" (LDCE). By decomposing the loss function of corrupted negative examples into two parts, we show that only the second part is affected by the noisy labels. Thereby, we may estimate the centroid of corrupted negative set via an unbiased way to reduce the adverse impact of such label noise. Furthermore, we propose the "Kernelized LDCE" (KLDCE) by introducing the kernel trick, and show that KLDCE can be easily solved by combining Alternative Convex Search (ACS) and Sequential Minimal Optimization (SMO). Theoretically, we derive the generalization error bound which suggests that the generalization risk of our model converges to the empirical risk with the order of O(1/√k+1/√{n-k}+1/√n) ( n and k are the amounts of training data and positive data correspondingly). Experimentally, we conduct intensive experiments on synthetic dataset, UCI benchmark datasets and real-world datasets, and the results demonstrate that our approaches (LDCE and KLDCE) achieve the top-level performance when compared with both classic and state-of-the-art PU learning methods.
Collapse
|
40
|
Gong C, Wang Q, Liu T, Han B, You JJ, Yang J, Tao D. Instance-Dependent Positive and Unlabeled Learning with Labeling Bias Estimation. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; PP:1-1. [PMID: 33621169 DOI: 10.1109/tpami.2021.3061456] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
This paper studies instance-dependent Positive and Unlabeled (PU) classification, where whether a positive example will be labeled (indicated by s) is not only related to the class label y, but also depends on the observation x. Therefore, the labeling probability on positive examples is not uniform as previous works assumed, but is biased to some simple or critical data points. To depict the above dependency relationship, a graphical model is built in this paper which further leads to a maximization problem on the induced likelihood function regarding P(s,y|x). By utilizing the well-known EM and Adam optimization techniques, the labeling probability of any positive example P(s=1|y=1,x) as well as the classifier induced by P(y|x) can be acquired. Theoretically, we prove that the critical solution always exists, and is locally unique for linear model if some sufficient conditions are met. Moreover, we upper bound the generalization error for both linear logistic and non-linear network instantiations of our algorithm. Empirically, we compare our method with state-of-the-art instance-independent and instance-dependent PU algorithms on a wide range of synthetic, benchmark and real-world datasets, and the experimental results firmly demonstrate the advantage of the proposed method over the existing PU approaches.
Collapse
|
41
|
Ding Y, Lei X, Liao B, Wu FX. Machine learning approaches for predicting biomolecule-disease associations. Brief Funct Genomics 2021; 20:273-287. [PMID: 33554238 DOI: 10.1093/bfgp/elab002] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Biomolecules, such as microRNAs, circRNAs, lncRNAs and genes, are functionally interdependent in human cells, and all play critical roles in diverse fundamental and vital biological processes. The dysregulations of such biomolecules can cause diseases. Identifying the associations between biomolecules and diseases can uncover the mechanisms of complex diseases, which is conducive to their diagnosis, treatment, prognosis and prevention. Due to the time consumption and cost of biologically experimental methods, many computational association prediction methods have been proposed in the past few years. In this study, we provide a comprehensive review of machine learning-based approaches for predicting disease-biomolecule associations with multi-view data sources. Firstly, we introduce some databases and general strategies for integrating multi-view data sources in the prediction models. Then we discuss several feature representation methods for machine learning-based prediction models. Thirdly, we comprehensively review machine learning-based prediction approaches in three categories: basic machine learning methods, matrix completion-based methods and deep learning-based methods, while discussing their advantages and disadvantages. Finally, we provide some perspectives for further improving biomolecule-disease prediction methods.
Collapse
Affiliation(s)
- Yulian Ding
- Division of Biomedical Engineering at the University of Saskatchewan
| | - Xiujuan Lei
- School of Computer Science at Shaanxi Normal University
| | - Bo Liao
- School of Mathematics and Statistics at Hainan Normal University, Haikou, China
| | - Fang-Xiang Wu
- College of Engineering and the Department of Computer Science at University of Saskatchewan
| |
Collapse
|
42
|
Ata SK, Wu M, Fang Y, Ou-Yang L, Kwoh CK, Li XL. Recent advances in network-based methods for disease gene prediction. Brief Bioinform 2020; 22:6023077. [PMID: 33276376 DOI: 10.1093/bib/bbaa303] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 09/29/2020] [Accepted: 10/10/2020] [Indexed: 01/28/2023] Open
Abstract
Disease-gene association through genome-wide association study (GWAS) is an arduous task for researchers. Investigating single nucleotide polymorphisms that correlate with specific diseases needs statistical analysis of associations. Considering the huge number of possible mutations, in addition to its high cost, another important drawback of GWAS analysis is the large number of false positives. Thus, researchers search for more evidence to cross-check their results through different sources. To provide the researchers with alternative and complementary low-cost disease-gene association evidence, computational approaches come into play. Since molecular networks are able to capture complex interplay among molecules in diseases, they become one of the most extensively used data for disease-gene association prediction. In this survey, we aim to provide a comprehensive and up-to-date review of network-based methods for disease gene prediction. We also conduct an empirical analysis on 14 state-of-the-art methods. To summarize, we first elucidate the task definition for disease gene prediction. Secondly, we categorize existing network-based efforts into network diffusion methods, traditional machine learning methods with handcrafted graph features and graph representation learning methods. Thirdly, an empirical analysis is conducted to evaluate the performance of the selected methods across seven diseases. We also provide distinguishing findings about the discussed methods based on our empirical analysis. Finally, we highlight potential research directions for future studies on disease gene prediction.
Collapse
Affiliation(s)
- Sezin Kircali Ata
- School of Computer Science and Engineering Nanyang Technological University (NTU)
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, Singapore
| | - Yuan Fang
- School of Information Systems, Singapore Management University, Singapore
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen China
| | | | - Xiao-Li Li
- Department head and principal scientist at I2R, A*STAR, Singapore
| |
Collapse
|
43
|
Makrodimitris S, van Ham RCHJ, Reinders MJT. Automatic Gene Function Prediction in the 2020's. Genes (Basel) 2020; 11:E1264. [PMID: 33120976 PMCID: PMC7692357 DOI: 10.3390/genes11111264] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 10/19/2020] [Accepted: 10/21/2020] [Indexed: 02/06/2023] Open
Abstract
The current rate at which new DNA and protein sequences are being generated is too fast to experimentally discover the functions of those sequences, emphasizing the need for accurate Automatic Function Prediction (AFP) methods. AFP has been an active and growing research field for decades and has made considerable progress in that time. However, it is certainly not solved. In this paper, we describe challenges that the AFP field still has to overcome in the future to increase its applicability. The challenges we consider are how to: (1) include condition-specific functional annotation, (2) predict functions for non-model species, (3) include new informative data sources, (4) deal with the biases of Gene Ontology (GO) annotations, and (5) maximally exploit the GO to obtain performance gains. We also provide recommendations for addressing those challenges, by adapting (1) the way we represent proteins and genes, (2) the way we represent gene functions, and (3) the algorithms that perform the prediction from gene to function. Together, we show that AFP is still a vibrant research area that can benefit from continuing advances in machine learning with which AFP in the 2020s can again take a large step forward reinforcing the power of computational biology.
Collapse
Affiliation(s)
- Stavros Makrodimitris
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands; (R.C.H.J.v.H.); (M.J.T.R.)
- Keygene N.V., 6708PW Wageningen, The Netherlands
| | - Roeland C. H. J. van Ham
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands; (R.C.H.J.v.H.); (M.J.T.R.)
- Keygene N.V., 6708PW Wageningen, The Netherlands
| | - Marcel J. T. Reinders
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands; (R.C.H.J.v.H.); (M.J.T.R.)
- Leiden Computational Biology Center, Leiden University Medical Center, 2333ZC Leiden, The Netherlands
| |
Collapse
|
44
|
Ju Z, Wang SY. Computational Identification of Lysine Glutarylation Sites Using Positive-Unlabeled Learning. Curr Genomics 2020; 21:204-211. [PMID: 33071614 PMCID: PMC7521029 DOI: 10.2174/1389202921666200511072327] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2019] [Revised: 04/12/2020] [Accepted: 04/13/2020] [Indexed: 12/27/2022] Open
Abstract
Background
As a new type of protein acylation modification, lysine glutarylation has been found to play a crucial role in metabolic processes and mitochondrial functions. To further explore the biological mechanisms and functions of glutarylation, it is significant to predict the potential glutarylation sites. In the existing glutarylation site predictors, experimentally verified glutarylation sites are treated as positive samples and non-verified lysine sites as the negative samples to train predictors. However, the non-verified lysine sites may contain some glutarylation sites which have not been experimentally identified yet. Methods
In this study, experimentally verified glutarylation sites are treated as the positive samples, whereas the remaining non-verified lysine sites are treated as unlabeled samples. A bioinformatics tool named PUL-GLU was developed to identify glutarylation sites using a positive-unlabeled learning algorithm. Results
Experimental results show that PUL-GLU significantly outperforms the current glutarylation site predictors. Therefore, PUL-GLU can be a powerful tool for accurate identification of protein glutarylation sites. Conclusion
A user-friendly web-server for PUL-GLU is available at http://bioinform.cn/pul_glu/.
Collapse
Affiliation(s)
- Zhe Ju
- College of Science, Shenyang Aerospace University, Shenyang110136, P.R. China
| | - Shi-Yun Wang
- College of Science, Shenyang Aerospace University, Shenyang110136, P.R. China
| |
Collapse
|
45
|
Zhang Y, Qiu Y, Cui Y, Liu S, Zhang W. Predicting drug-drug interactions using multi-modal deep auto-encoders based network embedding and positive-unlabeled learning. Methods 2020; 179:37-46. [DOI: 10.1016/j.ymeth.2020.05.007] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2020] [Revised: 05/06/2020] [Accepted: 05/13/2020] [Indexed: 12/21/2022] Open
|
46
|
Lan C, Chandrasekaran SN, Huan J. On the Unreported-Profile-is-Negative Assumption for Predictive Cheminformatics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1352-1363. [PMID: 31056508 DOI: 10.1109/tcbb.2019.2913855] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
In cheminformatics, compound-target binding profiles has been a main source of data for research. For data repositories that only provide positive profiles, a popular assumption is that unreported profiles are all negative. In this paper, we caution the audience not to take this assumption for granted, and present empirical evidence of its ineffectiveness from a machine learning perspective. Our examination is based on a setting where binding profiles are used as features to train predictive models; we show (1) prediction performance degrades when the assumption fails and (2) explicit recovery of unreported profiles improves prediction performance. In particular, we propose a framework that jointly recovers profiles and learns predictive model, and show it achieves further performance improvement. The presented study not only suggests applying matrix recovery methods to recover unreported profiles, but also initiates a new missing feature problem which we called Learning with Positive and Unknown Features.
Collapse
|
47
|
Le DH. Machine learning-based approaches for disease gene prediction. Brief Funct Genomics 2020; 19:350-363. [PMID: 32567652 DOI: 10.1093/bfgp/elaa013] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 04/30/2020] [Accepted: 05/09/2020] [Indexed: 12/20/2022] Open
Abstract
Disease gene prediction is an essential issue in biomedical research. In the early days, annotation-based approaches were proposed for this problem. With the development of high-throughput technologies, interaction data between genes/proteins have grown quickly and covered almost genome and proteome; thus, network-based methods for the problem become prominent. In parallel, machine learning techniques, which formulate the problem as a classification, have also been proposed. Here, we firstly show a roadmap of the machine learning-based methods for the disease gene prediction. In the beginning, the problem was usually approached using a binary classification, where positive and negative training sample sets are comprised of disease genes and non-disease genes, respectively. The disease genes are ones known to be associated with diseases; meanwhile, non-disease genes were randomly selected from those not yet known to be associated with diseases. However, the later may contain unknown disease genes. To overcome this uncertainty of defining the non-disease genes, more realistic approaches have been proposed for the problem, such as unary and semi-supervised classification. Recently, more advanced methods, including ensemble learning, matrix factorization and deep learning, have been proposed for the problem. Secondly, 12 representative machine learning-based methods for the disease gene prediction were examined and compared in terms of prediction performance and running time. Finally, their advantages, disadvantages, interpretability and trust were also analyzed and discussed.
Collapse
Affiliation(s)
- Duc-Hau Le
- Department of Computational Biomedicine, Vingroup Big Data Institute, Hanoi, Vietnam
| |
Collapse
|
48
|
Wang CC, Zhao Y, Chen X. Drug-pathway association prediction: from experimental results to computational models. Brief Bioinform 2020; 22:5835554. [PMID: 32393976 DOI: 10.1093/bib/bbaa061] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2020] [Revised: 03/16/2020] [Accepted: 03/26/2020] [Indexed: 12/14/2022] Open
Abstract
Effective drugs are urgently needed to overcome human complex diseases. However, the research and development of novel drug would take long time and cost much money. Traditional drug discovery follows the rule of one drug-one target, while some studies have demonstrated that drugs generally perform their task by affecting related pathway rather than targeting single target. Thus, the new strategy of drug discovery, namely pathway-based drug discovery, have been proposed. Obviously, identifying associations between drugs and pathways plays a key role in the development of pathway-based drug discovery. Revealing the drug-pathway associations by experiment methods would take much time and cost. Therefore, some computational models were established to predict potential drug-pathway associations. In this review, we first introduced the background of drug and the concept of drug-pathway associations. Then, some publicly accessible databases and web servers about drug-pathway associations were listed. Next, we summarized some state-of-the-art computational methods in the past years for inferring drug-pathway associations and divided these methods into three classes, namely Bayesian spare factor-based, matrix decomposition-based and other machine learning methods. In addition, we introduced several evaluation strategies to estimate the predictive performance of various computational models. In the end, we discussed the advantages and limitations of existing computational methods and provided some suggestions about the future directions of the data collection and the calculation models development.
Collapse
|
49
|
Tran VD, Sperduti A, Backofen R, Costa F. Heterogeneous networks integration for disease-gene prioritization with node kernels. Bioinformatics 2020; 36:2649-2656. [PMID: 31990289 DOI: 10.1093/bioinformatics/btaa008] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Revised: 12/19/2019] [Accepted: 01/23/2020] [Indexed: 01/03/2025] Open
Abstract
MOTIVATION The identification of disease-gene associations is a task of fundamental importance in human health research. A typical approach consists in first encoding large gene/protein relational datasets as networks due to the natural and intuitive property of graphs for representing objects' relationships and then utilizing graph-based techniques to prioritize genes for successive low-throughput validation assays. Since different types of interactions between genes yield distinct gene networks, there is the need to integrate different heterogeneous sources to improve the reliability of prioritization systems. RESULTS We propose an approach based on three phases: first, we merge all sources in a single network, then we partition the integrated network according to edge density introducing a notion of edge type to distinguish the parts and finally, we employ a novel node kernel suitable for graphs with typed edges. We show how the node kernel can generate a large number of discriminative features that can be efficiently processed by linear regularized machine learning classifiers. We report state-of-the-art results on 12 disease-gene associations and on a time-stamped benchmark containing 42 newly discovered associations. AVAILABILITY AND IMPLEMENTATION Source code: https://github.com/dinhinfotech/DiGI.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Van Dinh Tran
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg im Breisgau, Germany
| | | | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg im Breisgau, Germany
- Signalling Research Centres BIOSS and CIBSS, University of Freiburg, Germany
| | - Fabrizio Costa
- Department of Computer Science, University of Exeter, Exeter, UK
| |
Collapse
|
50
|
|