1
|
Zhao K, Ji Z, Zhang L, Quan N, Li Y, Yu G, Bi X. HPOseq: a deep ensemble model for predicting the protein-phenotype relationships based on protein sequences. BMC Bioinformatics 2025; 26:110. [PMID: 40263997 PMCID: PMC12013097 DOI: 10.1186/s12859-025-06122-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2024] [Accepted: 03/27/2025] [Indexed: 04/24/2025] Open
Abstract
BACKGROUND Understanding the relationships between proteins and specific disease phenotypes contributes to the early detection of diseases and advances the development of personalized medicine. The acquisition of a large amount of proteomics data has facilitated this process. To improve discovery efficiency and reduce the time and financial costs associated with biological experiments, various computational methods have yielded promising results. However, the lack of rich and reliable protein-related information still presents challenges in this process. RESULTS In this paper, we propose an ensemble prediction model, named HPOseq, which predicts human protein-phenotype relationships based only on sequence information. HPOseq establishes two base models to achieve objectives. One directly extracts internal information from amino acid sequences as protein features to predict the associated phenotypes. The other builds a protein-protein network based on sequence similarity, extracting information between proteins for phenotype prediction. Ultimately, an ensemble module is employed to integrate the predictions from both base models, resulting in the final prediction. CONCLUSION The results of 5-fold cross-validation reveal that HPOseq outperforms seven baseline methods for predicting protein-phenotype relationships. Moreover, we conduct case studies from the points of phenotype annotation and protein analysis to verify the practical significance of HPOseq.
Collapse
Affiliation(s)
- Kai Zhao
- School of Computer Science and Technology, Xinjiang University, Urumqi, 830011, China
| | - Zhuocheng Ji
- School of Computer Science and Technology, Xinjiang University, Urumqi, 830011, China
| | - Linlin Zhang
- School of Software, Xinjiang University, Urumqi, 830011, China
| | - Na Quan
- School of Computer Science and Technology, Xinjiang University, Urumqi, 830011, China
| | - Yuheng Li
- School of Computer Science and Technology, Xinjiang University, Urumqi, 830011, China
| | - Guanglei Yu
- College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, 830011, China
- School Of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Xuehua Bi
- College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, 830011, China.
- School Of Computer Science and Engineering, Central South University, Changsha, 410083, China.
| |
Collapse
|
2
|
Jay KL, Gogate N, Ezell K, Andrews JC, Jangam SV, Hall PI, Pan H, Pham K, German R, Gomez V, Jellinek-Russo E, Storch E, Brain Gene Registry Consortium, Undiagnosed Diseases Network, Yamamoto S, Kanca O, Bellen HJ, Dierick H, Cogan JD, Phillips JA, Hamid R, Cassini T, Rives L, Posey JE, Wangler MF. Resolution of SLC6A1 variable expressivity in a multi-generational family using deep clinical phenotyping and Drosophila models. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.09.27.24314092. [PMID: 39399018 PMCID: PMC11469343 DOI: 10.1101/2024.09.27.24314092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/15/2024]
Abstract
Purpose Variants in SLC6A1 result in a rare neurodevelopmental disorder characterized by a variable clinical presentation of symptoms including developmental delay, epilepsy, motor dysfunction, and autism spectrum disorder. SLC6A1 haploinsufficiency has been confirmed as the predominant pathway of SLC6A1-related neurodevelopmental disorders (NDDs), however, the molecular mechanism underlying the variable clinical presentation remains unclear. Methods Here, through work of the Undiagnosed Diseases Network, we identify an undiagnosed individual with an inherited p.(A334S) variant of uncertain significance. To resolve this case and better understand the variable expressivity with SLC6A1, we assess the phenotypes of the proband with a cohort of cases diagnosed with SLC6A1-related NDDs. We then create an allelic series in the Drosophila melanogaster to functionally characterize case variants. Results We identify significant clinical overlap between the unsolved case and confirmed cases of SLC6A1-related NDDs and find a mild to severe clinical presentation associated with missense variants. We confirm phenotypes in flies expressing SLC6A1 variants consistent with a partial loss-of-function mechanism. Conclusion We conclude that the p.(A334S) variant is a hypomorphic allele and begin to elucidate the underlying variability in SLC6A1-related NDDs. These insights will inform clinical diagnosis, prognosis, treatment and inform therapeutic design for those living with SLC6A1-related NDDs.
Collapse
Affiliation(s)
- Kristy L. Jay
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston TX 77030
| | - Nikhita Gogate
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Kim Ezell
- Department of Pediatrics, Division of Medical Genetics and Genomic Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Jonathan C. Andrews
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston TX 77030
| | - Sharayu V. Jangam
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston TX 77030
| | - Paige I. Hall
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston TX 77030
| | - Hongling Pan
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston TX 77030
| | - Kelvin Pham
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Ryan German
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston TX 77030
| | - Vanessa Gomez
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston TX 77030
| | | | - Eric Storch
- Department of Psychiatry, Baylor College of Medicine, Houston, TX, USA
| | | | - Shinya Yamamoto
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston TX 77030
| | - Oguz Kanca
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston TX 77030
| | - Hugo J. Bellen
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston TX 77030
| | - Herman Dierick
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Joy D. Cogan
- Department of Pediatrics, Division of Medical Genetics and Genomic Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - John A. Phillips
- Department of Pediatrics, Division of Medical Genetics and Genomic Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Rizwan Hamid
- Department of Pediatrics, Division of Medical Genetics and Genomic Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Thomas Cassini
- Department of Pediatrics, Division of Medical Genetics and Genomic Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Lynette Rives
- Department of Pediatrics, Division of Medical Genetics and Genomic Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Jennifer E. Posey
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Michael F. Wangler
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston TX 77030
| |
Collapse
|
3
|
Liu W, Wang Z, You R, Xie C, Wei H, Xiong Y, Yang J, Zhu S. PLMSearch: Protein language model powers accurate and fast sequence search for remote homology. Nat Commun 2024; 15:2775. [PMID: 38555371 PMCID: PMC10981738 DOI: 10.1038/s41467-024-46808-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2023] [Accepted: 03/08/2024] [Indexed: 04/02/2024] Open
Abstract
Homologous protein search is one of the most commonly used methods for protein annotation and analysis. Compared to structure search, detecting distant evolutionary relationships from sequences alone remains challenging. Here we propose PLMSearch (Protein Language Model), a homologous protein search method with only sequences as input. PLMSearch uses deep representations from a pre-trained protein language model and trains the similarity prediction model with a large number of real structure similarity. This enables PLMSearch to capture the remote homology information concealed behind the sequences. Extensive experimental results show that PLMSearch can search millions of query-target protein pairs in seconds like MMseqs2 while increasing the sensitivity by more than threefold, and is comparable to state-of-the-art structure search methods. In particular, unlike traditional sequence search methods, PLMSearch can recall most remote homology pairs with dissimilar sequences but similar structures. PLMSearch is freely available at https://dmiip.sjtu.edu.cn/PLMSearch .
Collapse
Affiliation(s)
- Wei Liu
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, 200433, Shanghai, China
| | - Ziye Wang
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, 200433, Shanghai, China
| | - Ronghui You
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, 200433, Shanghai, China
| | - Chenghan Xie
- School of Mathematical Sciences, Fudan University, 200433, Shanghai, China
| | - Hong Wei
- School of Mathematical Sciences, Nankai University, 300071, Tianjin, China
| | - Yi Xiong
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, 200240, Shanghai, China
| | - Jianyi Yang
- Ministry of Education Frontiers Science Center for Nonlinear Expectations, Research Center for Mathematics and Interdisciplinary Science, Shandong University, 266237, Qingdao, China.
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, 200433, Shanghai, China.
- Shanghai Qi Zhi Institute, Shanghai, China.
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China.
- Shanghai Key Lab of Intelligent Information Processing and Shanghai Institute of Artificial Intelligence Algorithm, Fudan University, Shanghai, China.
- Zhangjiang Fudan International Innovation Center, Shanghai, China.
| |
Collapse
|
4
|
Brechtmann F, Bechtler T, Londhe S, Mertes C, Gagneur J. Evaluation of input data modality choices on functional gene embeddings. NAR Genom Bioinform 2023; 5:lqad095. [PMID: 37942285 PMCID: PMC10629286 DOI: 10.1093/nargab/lqad095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 09/07/2023] [Accepted: 09/28/2023] [Indexed: 11/10/2023] Open
Abstract
Functional gene embeddings, numerical vectors capturing gene function, provide a promising way to integrate functional gene information into machine learning models. These embeddings are learnt by applying self-supervised machine-learning algorithms on various data types including quantitative omics measurements, protein-protein interaction networks and literature. However, downstream evaluations comparing alternative data modalities used to construct functional gene embeddings have been lacking. Here we benchmarked functional gene embeddings obtained from various data modalities for predicting disease-gene lists, cancer drivers, phenotype-gene associations and scores from genome-wide association studies. Off-the-shelf predictors trained on precomputed embeddings matched or outperformed dedicated state-of-the-art predictors, demonstrating their high utility. Embeddings based on literature and protein-protein interactions inferred from low-throughput experiments outperformed embeddings derived from genome-wide experimental data (transcriptomics, deletion screens and protein sequence) when predicting curated gene lists. In contrast, they did not perform better when predicting genome-wide association signals and were biased towards highly-studied genes. These results indicate that embeddings derived from literature and low-throughput experiments appear favourable in many existing benchmarks because they are biased towards well-studied genes and should therefore be considered with caution. Altogether, our study and precomputed embeddings will facilitate the development of machine-learning models in genetics and related fields.
Collapse
Affiliation(s)
- Felix Brechtmann
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Thibault Bechtler
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Shubhankar Londhe
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Christian Mertes
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Munich Data Science Institute, Technical University of Munich, Garching, Germany
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany
| | - Julien Gagneur
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany
| |
Collapse
|
5
|
Bi X, Liang W, Zhao Q, Wang J. SSLpheno: a self-supervised learning approach for gene-phenotype association prediction using protein-protein interactions and gene ontology data. Bioinformatics 2023; 39:btad662. [PMID: 37941450 PMCID: PMC10666204 DOI: 10.1093/bioinformatics/btad662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 10/17/2023] [Accepted: 11/03/2023] [Indexed: 11/10/2023] Open
Abstract
MOTIVATION Medical genomics faces significant challenges in interpreting disease phenotype and genetic heterogeneity. Despite the establishment of standardized disease phenotype databases, computational methods for predicting gene-phenotype associations still suffer from imbalanced category distribution and a lack of labeled data in small categories. RESULTS To address the problem of labeled-data scarcity, we propose a self-supervised learning strategy for gene-phenotype association prediction, called SSLpheno. Our approach utilizes an attributed network that integrates protein-protein interactions and gene ontology data. We apply a Laplacian-based filter to ensure feature smoothness and use self-supervised training to optimize node feature representation. Specifically, we calculate the cosine similarity of feature vectors and select positive and negative sample nodes for reconstruction training labels. We employ a deep neural network for multi-label classification of phenotypes in the downstream task. Our experimental results demonstrate that SSLpheno outperforms state-of-the-art methods, especially in categories with fewer annotations. Moreover, our case studies illustrate the potential of SSLpheno as an effective prescreening tool for gene-phenotype association identification. AVAILABILITY AND IMPLEMENTATION https://github.com/bixuehua/SSLpheno.
Collapse
Affiliation(s)
- Xuehua Bi
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Medical Engineering and Technology College, Xinjiang Medical University, Urumqi 830017, China
| | - Weiyang Liang
- College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
| | - Qichang Zhao
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
6
|
Zhang W, Liu B. iSnoDi-MDRF: Identifying snoRNA-Disease Associations Based on Multiple Biological Data by Ranking Framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3013-3019. [PMID: 37030816 DOI: 10.1109/tcbb.2023.3258448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Accumulating evidence indicates that the dysregulation of small nucleolar RNAs (snoRNAs) is relevant with diseases. Identifying snoRNA-disease associations by computational methods is desired for biologists, which can save considerable costs and time compared biological experiments. However, it still faces some challenges as followings: (i) Many snoRNAs are detected in recent years, but only a few snoRNAs have been proved to be associated with diseases; (ii) Computational predictors trained with only a few known snoRNA-disease associations fail to accurately identify the snoRNA-disease associations. In this study, we propose a ranking framework, called iSnoDi-MDRF, to identify potential snoRNA-disease associations based on multiple biological data, which has the following highlights: (i) iSnoDi-MDRF integrates ranking framework, which is not only able to identify potential associations between known snoRNAs and diseases, but also can identify diseases associated with new snoRNAs. (ii) Known gene-disease associations are employed to help train a mature model for predicting snoRNA-disease association. Experimental results illustrate that iSnoDi-MDRF is very suitable for identifying potential snoRNA-disease associations. The web server of iSnoDi-MDRF predictor is freely available at http://bliulab.net/iSnoDi-MDRF/.
Collapse
|
7
|
Liang Q, Zhang W, Wu H, Liu B. LncRNA-disease association identification using graph auto-encoder and learning to rank. Brief Bioinform 2023; 24:6955271. [PMID: 36545805 DOI: 10.1093/bib/bbac539] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2022] [Revised: 10/18/2022] [Accepted: 11/08/2022] [Indexed: 12/24/2022] Open
Abstract
Discovering the relationships between long non-coding RNAs (lncRNAs) and diseases is significant in the treatment, diagnosis and prevention of diseases. However, current identified lncRNA-disease associations are not enough because of the expensive and heavy workload of wet laboratory experiments. Therefore, it is greatly important to develop an efficient computational method for predicting potential lncRNA-disease associations. Previous methods showed that combining the prediction results of the lncRNA-disease associations predicted by different classification methods via Learning to Rank (LTR) algorithm can be effective for predicting potential lncRNA-disease associations. However, when the classification results are incorrect, the ranking results will inevitably be affected. We propose the GraLTR-LDA predictor based on biological knowledge graphs and ranking framework for predicting potential lncRNA-disease associations. Firstly, homogeneous graph and heterogeneous graph are constructed by integrating multi-source biological information. Then, GraLTR-LDA integrates graph auto-encoder and attention mechanism to extract embedded features from the constructed graphs. Finally, GraLTR-LDA incorporates the embedded features into the LTR via feature crossing statistical strategies to predict priority order of diseases associated with query lncRNAs. Experimental results demonstrate that GraLTR-LDA outperforms the other state-of-the-art predictors and can effectively detect potential lncRNA-disease associations. Availability and implementation: Datasets and source codes are available at http://bliulab.net/GraLTR-LDA.
Collapse
Affiliation(s)
- Qi Liang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Wenxiang Zhang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
8
|
Integration of Human Protein Sequence and Protein-Protein Interaction Data by Graph Autoencoder to Identify Novel Protein-Abnormal Phenotype Associations. Cells 2022; 11:cells11162485. [PMID: 36010562 PMCID: PMC9406402 DOI: 10.3390/cells11162485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Revised: 07/31/2022] [Accepted: 08/05/2022] [Indexed: 11/18/2022] Open
Abstract
Understanding gene functions and their associated abnormal phenotypes is crucial in the prevention, diagnosis and treatment against diseases. The Human Phenotype Ontology (HPO) is a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. However, the current HPO annotations are far from completion, and only a small fraction of human protein-coding genes has HPO annotations. Thus, it is necessary to predict protein-phenotype associations using computational methods. Protein sequences can indicate the structure and function of the proteins, and interacting proteins are more likely to have same function. It is promising to integrate these features for predicting HPO annotations of human protein. We developed GraphPheno, a semi-supervised method based on graph autoencoders, which does not require feature engineering to capture deep features from protein sequences, while also taking into account the topological properties in the protein–protein interaction network to predict the relationships between human genes/proteins and abnormal phenotypes. Cross validation and independent dataset tests show that GraphPheno has satisfactory prediction performance. The algorithm is further confirmed on automatic HPO annotation for no-knowledge proteins under the benchmark of the second Critical Assessment of Functional Annotation, 2013–2014 (CAFA2), where GraphPheno surpasses most existing methods. Further bioinformatics analysis shows that predicted certain phenotype-associated genes using GraphPheno share similar biological properties with known ones. In a case study on the phenotype of abnormality of mitochondrial respiratory chain, top prioritized genes are validated by recent papers. We believe that GraphPheno will help to reveal more associations between genes and phenotypes, and contribute to the discovery of drug targets.
Collapse
|
9
|
Zhang W, Hou J, Liu B. iPiDA-LTR: Identifying piwi-interacting RNA-disease associations based on Learning to Rank. PLoS Comput Biol 2022; 18:e1010404. [PMID: 35969645 PMCID: PMC9410559 DOI: 10.1371/journal.pcbi.1010404] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Revised: 08/25/2022] [Accepted: 07/18/2022] [Indexed: 12/01/2022] Open
Abstract
Piwi-interacting RNAs (piRNAs) are regarded as drug targets and biomarkers for the diagnosis and therapy of diseases. However, biological experiments cost substantial time and resources, and the existing computational methods only focus on identifying missing associations between known piRNAs and diseases. With the fast development of biological experiments, more and more piRNAs are detected. Therefore, the identification of piRNA-disease associations of newly detected piRNAs has significant theoretical value and practical significance on pathogenesis of diseases. In this study, the iPiDA-LTR predictor is proposed to identify associations between piRNAs and diseases based on Learning to Rank. The iPiDA-LTR predictor not only identifies the missing associations between known piRNAs and diseases, but also detects diseases associated with newly detected piRNAs. Experimental results demonstrate that iPiDA-LTR effectively predicts piRNA-disease associations outperforming the other related methods.
Collapse
Affiliation(s)
- Wenxiang Zhang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Jialu Hou
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
10
|
Zhang W, Wei H, Liu B. idenMD-NRF: a ranking framework for miRNA-disease association identification. Brief Bioinform 2022; 23:6604995. [PMID: 35679537 DOI: 10.1093/bib/bbac224] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 04/18/2022] [Accepted: 05/11/2022] [Indexed: 11/12/2022] Open
Abstract
Identifying miRNA-disease associations is an important task for revealing pathogenic mechanism of complicated diseases. Different computational methods have been proposed. Although these methods obtained encouraging performance for detecting missing associations between known miRNAs and diseases, how to accurately predict associated diseases for new miRNAs is still a difficult task. In this regard, a ranking framework named idenMD-NRF is proposed for miRNA-disease association identification. idenMD-NRF treats the miRNA-disease association identification as an information retrieval task. Given a novel query miRNA, idenMD-NRF employs Learning to Rank algorithm to rank associated diseases based on high-level association features and various predictors. The experimental results on two independent test datasets indicate that idenMD-NRF is superior to other compared predictors. A user-friendly web server of idenMD-NRF predictor is freely available at http://bliulab.net/idenMD-NRF/.
Collapse
Affiliation(s)
- Wenxiang Zhang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China
| | - Hang Wei
- School of Computer Science and Technology, Xidian University, Xi'an, Shaanxi 710071, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, 100081, China
| |
Collapse
|
11
|
Wu H, Liang Q, Zhang W, Zou Q, El-Latif Hesham A, Liu B. iLncDA-LTR: Identification of lncRNA-disease associations by learning to rank. Comput Biol Med 2022; 146:105605. [PMID: 35594681 DOI: 10.1016/j.compbiomed.2022.105605] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Revised: 04/27/2022] [Accepted: 05/09/2022] [Indexed: 12/12/2022]
Abstract
Identifying the associations between lncRNAs and diseases is helpful for the treatment and diagnosis of complex diseases. The existing computational methods mainly focus on the identification of associations between known lncRNAs and known diseases. However, with the application of high-throughput sequencing in lncRNA research, more and more lncRNAs have been detected. Predicting diseases related with newly detected lncRNAs has not been fully explored. Therefore, there is an urgent need for developing powerful computational methods to predict diseases related with newly detected lncRNAs. In this paper, we propose a Learning to Rank (LTR)-based method called iLncDA-LTR to predict diseases related with newly detected lncRNAs. iLncDA-LTR treats this task as an information retrieval task. The newly detected lncRNAs and diseases are considered as queries and documents, respectively. For a given newly detected lncRNA (query), iLncDA-LTR integrates multiple relevant information into LTR for predicting candidate diseases associated with query lncRNA. Experimental results show that iLncDA-LTR outperforms the other exiting state-of-the-art predictors on independent dataset. The corresponding web server of iLncDA-LTR has been constructed as well (http://bliulab.net/iLncDA-LTR/).
Collapse
Affiliation(s)
- Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Qi Liang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Wenxiang Zhang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
| | - Abd El-Latif Hesham
- Genetics Department, Faculty of Agriculture, Beni-Suef University, Beni-Suef, 62511, Egypt.
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China; Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China.
| |
Collapse
|
12
|
Ru X, Ye X, Sakurai T, Zou Q. NerLTR-DTA: drug-target binding affinity prediction based on neighbor relationship and learning to rank. Bioinformatics 2022; 38:1964-1971. [PMID: 35134828 DOI: 10.1093/bioinformatics/btac048] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2021] [Revised: 12/20/2021] [Accepted: 01/28/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Drug-target interaction prediction plays an important role in new drug discovery and drug repurposing. Binding affinity indicates the strength of drug-target interactions. Predicting drug-target binding affinity is expected to provide promising candidates for biologists, which can effectively reduce the workload of wet laboratory experiments and speed up the entire process of drug research. Given that, numerous new proteins are sequenced and compounds are synthesized, several improved computational methods have been proposed for such predictions, but there are still some challenges. (i) Many methods only discuss and implement one application scenario, they focus on drug repurposing and ignore the discovery of new drugs and targets. (ii) Many methods do not consider the priority order of proteins (or drugs) related to each target drug (or protein). Therefore, it is necessary to develop a comprehensive method that can be used in multiple scenarios and focuses on candidate order. RESULTS In this study, we propose a method called NerLTR-DTA that uses the neighbor relationship of similarity and sharing to extract features, and applies a ranking framework with regression attributes to predict affinity values and priority order of query drug (or query target) and its related proteins (or compounds). It is worth noting that using the characteristics of learning to rank to set different queries can smartly realize the multi-scenario application of the method, including the discovery of new drugs and new targets. Experimental results on two commonly used datasets show that NerLTR-DTA outperforms some state-of-the-art competing methods. NerLTR-DTA achieves excellent performance in all application scenarios mentioned in this study, and the rm(test)2 values guarantee such excellent performance is not obtained by chance. Moreover, it can be concluded that NerLTR-DTA can provide accurate ranking lists for the relevant results of most queries through the statistics of the association relationship of each query drug (or query protein). In general, NerLTR-DTA is a powerful tool for predicting drug-target associations and can contribute to new drug discovery and drug repurposing. AVAILABILITY AND IMPLEMENTATION The proposed method is implemented in Python and Java. Source codes and datasets are available at https://github.com/RUXIAOQING964914140/NerLTR-DTA.
Collapse
Affiliation(s)
- Xiaoqing Ru
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan.,Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324000, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China.,Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324000, China
| |
Collapse
|
13
|
He Q, Shen H, Shao X, Chen W, Wu Y, Liu R, Li S, Zhou Z. Cardiovascular Phenotypes Profiling for L-Transposition of the Great Arteries and Prognosis Analysis. Front Cardiovasc Med 2022; 8:781041. [PMID: 35127856 PMCID: PMC8814104 DOI: 10.3389/fcvm.2021.781041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 12/23/2021] [Indexed: 11/24/2022] Open
Abstract
Objectives Congenitally corrected transposition of the great arteries (ccTGA) is a rare and complex congenital heart disease with the characteristics of double discordance. Enormous co-existed anomalies are the culprit of prognosis evaluation and clinical decision. We aim at delineating a novel ccTGA clustering modality under human phenotype ontology (HPO) instruction and elucidating the relationship between phenotypes and prognosis in patients with ccTGA. Methods A retrospective review of 270 patients diagnosed with ccTGA in Fuwai hospital from 2009 to 2020 and cross-sectional follow-up were performed. HPO-instructed clustering method was administered in ccTGA risk stratification. Kaplan-Meier survival, Landmark analysis, and cox regression analysis were used to investigate the difference of outcomes among clusters. Results The median follow-up time was 4.29 (2.07–7.37) years. A total of three distinct phenotypic clusters were obtained after HPO-instructed clustering with 21 in cluster 1, 136 in cluster 2, and 113 in cluster 3. Landmark analysis revealed significantly worse mid-term outcomes in all-cause mortality (p = 0.021) and composite endpoints (p = 0.004) of cluster 3 in comparison with cluster 1 and cluster 2. Multivariate analysis indicated that pulmonary arterial hypertension (PAH), atrioventricular septal defect (AVSD), and arrhythmia were risk factors for composite endpoints. Moreover, the surgical treatment was significantly different among the three groups (p < 0.001) and surgical strategies had different effects on the prognosis of the different phenotypic clusters. Conclusions Human phenotype ontology-instructed clustering can be a potentially powerful tool for phenotypic risk stratification in patients with complex congenital heart diseases, which may improve prognosis prediction and clinical decision.
Collapse
Affiliation(s)
- Qiyu He
- Pediatric Cardiac Center, Fuwai Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Huayan Shen
- Department of Laboratory Medicine, National Center for Cardiovascular Diseases, Fuwai Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Xinyang Shao
- Department of Laboratory Medicine, National Center for Cardiovascular Diseases, Fuwai Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Wen Chen
- Department of Laboratory Medicine, National Center for Cardiovascular Diseases, Fuwai Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Yafeng Wu
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Rui Liu
- Pediatric Cardiac Center, Fuwai Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Shoujun Li
- Pediatric Cardiac Center, Fuwai Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
- *Correspondence: Shoujun Li
| | - Zhou Zhou
- Department of Laboratory Medicine, National Center for Cardiovascular Diseases, Fuwai Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
- Zhou Zhou
| |
Collapse
|
14
|
Liu L, Mamitsuka H, Zhu S. HPODNets: deep graph convolutional networks for predicting human protein-phenotype associations. Bioinformatics 2022; 38:799-808. [PMID: 34672333 DOI: 10.1093/bioinformatics/btab729] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Revised: 09/18/2021] [Accepted: 10/18/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Deciphering the relationship between human genes/proteins and abnormal phenotypes is of great importance in the prevention, diagnosis and treatment against diseases. The Human Phenotype Ontology (HPO) is a standardized vocabulary that describes the phenotype abnormalities encountered in human disorders. However, the current HPO annotations are still incomplete. Thus, it is necessary to computationally predict human protein-phenotype associations. In terms of current, cutting-edge computational methods for annotating proteins (such as functional annotation), three important features are (i) multiple network input, (ii) semi-supervised learning and (iii) deep graph convolutional network (GCN), whereas there are no methods with all these features for predicting HPO annotations of human protein. RESULTS We develop HPODNets with all above three features for predicting human protein-phenotype associations. HPODNets adopts a deep GCN with eight layers which allows to capture high-order topological information from multiple interaction networks. Empirical results with both cross-validation and temporal validation demonstrate that HPODNets outperforms seven competing state-of-the-art methods for protein function prediction. HPODNets with the architecture of deep GCNs is confirmed to be effective for predicting HPO annotations of human protein and, more generally, node label ranking problem with multiple biomolecular networks input in bioinformatics. AVAILABILITY AND IMPLEMENTATION https://github.com/liulizhi1996/HPODNets. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lizhi Liu
- School of Computer Science, Fudan University, Shanghai 200433, China
| | - Hiroshi Mamitsuka
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto Prefecture 611-0011, Japan.,Department of Computer Science, Aalto University, Espoo 02150, Finland
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai 200433, China.,MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China.,Zhangjiang Fudan International Innovation Center, Shanghai 200433, China.,Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China.,Institute of Artificial Intelligence Biomedicine, Nanjing University, Nanjing 210032, China
| |
Collapse
|
15
|
Pourreza Shahri M, Kahanda I. Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes. BMC Bioinformatics 2021; 22:500. [PMID: 34656098 PMCID: PMC8520253 DOI: 10.1186/s12859-021-04421-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 10/04/2021] [Indexed: 11/13/2022] Open
Abstract
Background Identifying human protein-phenotype relationships has attracted researchers in bioinformatics and biomedical natural language processing due to its importance in uncovering rare and complex diseases. Since experimental validation of protein-phenotype associations is prohibitive, automated tools capable of accurately extracting these associations from the biomedical text are in high demand. However, while the manual annotation of protein-phenotype co-mentions required for training such models is highly resource-consuming, extracting millions of unlabeled co-mentions is straightforward. Results In this study, we propose a novel deep semi-supervised ensemble framework that combines deep neural networks, semi-supervised, and ensemble learning for classifying human protein-phenotype co-mentions with the help of unlabeled data. This framework allows the ability to incorporate an extensive collection of unlabeled sentence-level co-mentions of human proteins and phenotypes with a small labeled dataset to enhance overall performance. We develop PPPredSS, a prototype of our proposed semi-supervised framework that combines sophisticated language models, convolutional networks, and recurrent networks. Our experimental results demonstrate that the proposed approach provides a new state-of-the-art performance in classifying human protein-phenotype co-mentions by outperforming other supervised and semi-supervised counterparts. Furthermore, we highlight the utility of PPPredSS in powering a curation assistant system through case studies involving a group of biologists. Conclusions This article presents a novel approach for human protein-phenotype co-mention classification based on deep, semi-supervised, and ensemble learning. The insights and findings from this work have implications for biomedical researchers, biocurators, and the text mining community working on biomedical relationship extraction.
Collapse
Affiliation(s)
| | - Indika Kahanda
- School of Computing, University of North Florida, Jacksonville, USA.
| |
Collapse
|
16
|
An automated and combinative method for the predictive ranking of candidate effector proteins of fungal plant pathogens. Sci Rep 2021; 11:19731. [PMID: 34611252 PMCID: PMC8492765 DOI: 10.1038/s41598-021-99363-0] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Accepted: 09/16/2021] [Indexed: 01/29/2023] Open
Abstract
Fungal plant-pathogens promote infection of their hosts through the release of 'effectors'-a broad class of cytotoxic or virulence-promoting molecules. Effectors may be recognised by resistance or sensitivity receptors in the host, which can determine disease outcomes. Accurate prediction of effectors remains a major challenge in plant pathology, but if achieved will facilitate rapid improvements to host disease resistance. This study presents a novel tool and pipeline for the ranking of predicted effector candidates-Predector-which interfaces with multiple software tools and methods, aggregates disparate features that are relevant to fungal effector proteins, and applies a pairwise learning to rank approach. Predector outperformed a typical combination of secretion and effector prediction methods in terms of ranking performance when applied to a curated set of confirmed effectors derived from multiple species. We present Predector ( https://github.com/ccdmb/predector ) as a useful tool for the ranking of predicted effector candidates, which also aggregates and reports additional supporting information relevant to effector and secretome prediction in a simple, efficient, and reproducible manner.
Collapse
|
17
|
Liu L, Zhu S. Computational Methods for Prediction of Human Protein-Phenotype Associations: A Review. PHENOMICS (CHAM, SWITZERLAND) 2021; 1:171-185. [PMID: 36939789 PMCID: PMC9590544 DOI: 10.1007/s43657-021-00019-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Revised: 06/05/2021] [Accepted: 06/16/2021] [Indexed: 12/01/2022]
Abstract
Deciphering the relationship between human proteins (genes) and phenotypes is one of the fundamental tasks in phenomics research. The Human Phenotype Ontology (HPO) builds upon a standardized logical vocabulary to describe the abnormal phenotypes encountered in human diseases and paves the way towards the computational analysis of their genetic causes. To date, many computational methods have been proposed to predict the HPO annotations of proteins. In this paper, we conduct a comprehensive review of the existing approaches to predicting HPO annotations of novel proteins, identifying missing HPO annotations, and prioritizing candidate proteins with respect to a certain HPO term. For each topic, we first give the formalized description of the problem, and then systematically revisit the published literatures highlighting their advantages and disadvantages, followed by the discussion on the challenges and promising future directions. In addition, we point out several potential topics to be worthy of exploration including the selection of negative HPO annotations and detecting HPO misannotations. We believe that this review will provide insight to the researchers in the field of computational phenotype analyses in terms of comprehending and developing novel prediction algorithms.
Collapse
Affiliation(s)
- Lizhi Liu
- School of Computer Science, Fudan University, Shanghai, 200433 China
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, 200433 China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, 200433 China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai, 200433 China
- Zhangjiang Fudan International Innovation Center, Shanghai, 200433 China
- Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, 200433 China
| |
Collapse
|
18
|
Notaro M, Frasca M, Petrini A, Gliozzo J, Casiraghi E, Robinson PN, Valentini G. HEMDAG: a family of modular and scalable hierarchical ensemble methods to improve Gene Ontology term prediction. Bioinformatics 2021; 37:4526-4533. [PMID: 34240108 DOI: 10.1093/bioinformatics/btab485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 06/15/2021] [Accepted: 07/04/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Automated protein function prediction is a complex multi-class, multi-label, structured classification problem in which protein functions are organized in a controlled vocabulary, according to the Gene Ontology (GO). "Hierarchy-unaware" classifiers, also known as "flat" methods, predict GO terms without exploiting the inherent structure of the ontology, potentially violating the True-Path-Rule (TPR) that governs the GO, while "hierarchy-aware" approaches, even if they obey the TPR, do not always show clear improvements with respect to flat methods, or do not scale well when applied to the full GO. RESULTS To overcome these limitations, we propose Hierarchical Ensemble Methods for Directed Acyclic Graphs (HEMDAG), a family of highly modular hierarchical ensembles of classifiers, able to build upon any flat method and to provide "TPR-safe" predictions, by leveraging a combination of isotonic regression and TPR learning strategies. Extensive experiments on synthetic and real data across several organisms firstly show that HEMDAG can be used as a general tool to improve the predictions of flat classifiers, and secondly that HEMDAG is competitive versus state-of-the-art hierarchy-aware learning methods proposed in the last CAFA international challenges. AVAILABILITY Fully-tested R code freely available at https://anaconda.org/bioconda/r-hemdag. Tutorial and documentation at https://hemdag.readthedocs.io. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marco Notaro
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Marco Frasca
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Alessandro Petrini
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Jessica Gliozzo
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Elena Casiraghi
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, US
| | - Giorgio Valentini
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy.,CINI, National Laboratory in Artificial Intelligence and Intelligent Systems-AIIS, Roma, Italy.,Data Science Research Center, Università degli Studi di Milano, Milano, 20133, Italy
| |
Collapse
|
19
|
Jin X, Liao Q, Liu B. S2L-PSIBLAST: a supervised two-layer search framework based on PSI-BLAST for protein remote homology detection. Bioinformatics 2021; 37:4321-4327. [PMID: 34170287 DOI: 10.1093/bioinformatics/btab472] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 05/29/2021] [Accepted: 06/24/2021] [Indexed: 01/26/2023] Open
Abstract
MOTIVATION Protein remote homology detection is a challenging task for the studies of protein evolutionary relationships. PSI-BLAST is an important and fundamental search method for detecting homology proteins. Although many improved versions of PSI-BLAST have been proposed, their performance is limited by the search processes of PSI-BLAST. RESULTS For further improving the performance of PSI-BLAST for protein remote homology detection, a supervised two-layer search framework based on PSI-BLAST (S2L-PSIBLAST) is proposed. S2L-PSIBLAST consists of a two-level search: the first-level search provides high-quality search results by using SMI-BLAST framework and double-link strategy to filter the non-homology protein sequences, the second-level search detects more homology proteins by profile-link similarity, and more accurate ranking lists for those detected protein sequences are obtained by learning to rank strategy. Experimental results on the updated version of Structural Classification of Proteins-extended benchmark dataset show that S2L-PSIBLAST not only obviously improves the performance of PSI-BLAST, but also achieves better performance on two improved versions of PSI-BLAST: DELTA-BLAST and PSI-BLASTexB. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaopeng Jin
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Qing Liao
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.,School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
20
|
Wei H, Xu Y, Liu B. iCircDA-LTR: identification of circRNA-disease associations based on Learning to Rank. Bioinformatics 2021; 37:3302-3310. [PMID: 33963827 DOI: 10.1093/bioinformatics/btab334] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Revised: 03/23/2021] [Accepted: 05/04/2021] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION Due to the inherent stability and close relationship with the progression of diseases, circRNAs are serving as important biomarkers and drug targets. Efficient predictors for identifying circRNA-disease associations are highly required. The existing predictors consider circRNA-disease association prediction as a classification task or a recommendation problem, failing to capture the ranking information among the associations and detect the diseases associated with new circRNAs. However, more and more circRNAs are discovered. Identification of the diseases associated with these new circRNAs remains a challenging task. RESULTS In this study, we proposed a new predictor called iCricDA-LTR for circRNA-disease association prediction. Different from any existing predictor, iCricDA-LTR employed a ranking framework to model the global ranking associations among the query circRNAs and the diseases. The Learning to Rank (LTR) algorithm was employed to rank the associations based on various predictors and features in a supervised manner. The experimental results on two independent test datasets showed that iCircDA-LTR outperformed the other competing methods, especially for predicting the diseases associated with new circRNAs. As a result, iCircDA-LTR is more suitable for the real world applications. AVAILABILITY For the convenience of researchers to detect new circRNA-disease associations. The web server of iCircDA-LTR was established and freely available at http://bliulab.net/iCircDA-LTR/.
Collapse
Affiliation(s)
- Hang Wei
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Yong Xu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.,School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
21
|
Liu L, Mamitsuka H, Zhu S. HPOFiller: identifying missing protein-phenotype associations by graph convolutional network. Bioinformatics 2021; 37:3328-3336. [PMID: 33822886 DOI: 10.1093/bioinformatics/btab224] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Revised: 02/20/2021] [Accepted: 04/05/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Exploring the relationship between human proteins and abnormal phenotypes is of great importance in the prevention, diagnosis and treatment of diseases. The human phenotype ontology (HPO) is a standardized vocabulary that describes the phenotype abnormalities encountered in human diseases. However, the current HPO annotations of proteins are not complete. Thus, it is important to identify missing protein-phenotype associations. RESULTS We propose HPOFiller, a graph convolutional network (GCN)-based approach, for predicting missing HPO annotations. HPOFiller has two key GCN components for capturing embeddings from complex network structures: 1) S-GCN for both protein-protein interaction (PPI) network and HPO semantic similarity network to utilize network weights; 2) Bi-GCN for the protein-phenotype bipartite graph to conduct message passing between proteins and phenotypes. The core idea of HPOFiller is to repeat run these two GCN modules consecutively over the three networks, to refine the embeddings. Empirical results of extremely stringent evaluation avoiding potential information leakage including cross-validation and temporal validation demonstrates that HPOFiller significantly outperforms all other state-of-the-art methods. In particular, the ablation study shows that batch normalization contributes the most to the performance. The further examination offers literature evidence for highly ranked predictions. Finally using known disease-HPO term associations, HPOFiller could suggest promising, unknown disease-gene associations, presenting possible genetic causes of human disorders. AVAILABILITY https://github.com/liulizhi1996/HPOFiller. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lizhi Liu
- School of Computer Science, Fudan University, Shanghai, 200433, China
| | - Hiroshi Mamitsuka
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto Prefecture, Japan.,Department of Computer Science, Aalto University, Espoo, Finland
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, 200433, China.,Ministry of Education, Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), China.,Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, 200433, China
| |
Collapse
|