1
|
Wang L, You ZH, Huang DS, Li JQ. MGRCDA: Metagraph Recommendation Method for Predicting CircRNA-Disease Association. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:67-75. [PMID: 34236991 DOI: 10.1109/tcyb.2021.3090756] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Clinical evidence began to accumulate, suggesting that circRNAs can be novel therapeutic targets for various diseases and play a critical role in human health. However, limited by the complex mechanism of circRNA, it is difficult to quickly and large-scale explore the relationship between disease and circRNA in the wet-lab experiment. In this work, we design a new computational model MGRCDA on account of the metagraph recommendation theory to predict the potential circRNA-disease associations. Specifically, we first regard the circRNA-disease association prediction problem as the system recommendation problem, and design a series of metagraphs according to the heterogeneous biological networks; then extract the semantic information of the disease and the Gaussian interaction profile kernel (GIPK) similarity of circRNA and disease as network attributes; finally, the iterative search of the metagraph recommendation algorithm is used to calculate the scores of the circRNA-disease pair. On the gold standard dataset circR2Disease, MGRCDA achieved a prediction accuracy of 92.49% with an area under the ROC curve of 0.9298, which is significantly higher than other state-of-the-art models. Furthermore, among the top 30 disease-related circRNAs recommended by the model, 25 have been verified by the latest published literature. The experimental results prove that MGRCDA is feasible and efficient, and it can recommend reliable candidates to further wet-lab experiment and reduce the scope of the experiment.
Collapse
|
2
|
Nguyen QH, Ngo HH, Nguyen-Vo TH, Do TT, Rahardja S, Nguyen BP. eMIC-AntiKP: Estimating minimum inhibitory concentrations of antibiotics towards Klebsiella pneumoniae using deep learning. Comput Struct Biotechnol J 2022; 21:751-757. [PMID: 36659924 PMCID: PMC9827358 DOI: 10.1016/j.csbj.2022.12.041] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 12/22/2022] [Accepted: 12/23/2022] [Indexed: 12/27/2022] Open
Abstract
Nowadays, antibiotic resistance has become one of the most concerning problems that directly affects the recovery process of patients. For years, numerous efforts have been made to efficiently use antimicrobial drugs with appropriate doses not only to exterminate microbes but also stringently constrain any chances for bacterial evolution. However, choosing proper antibiotics is not a straightforward and time-effective process because well-defined drugs can only be given to patients after determining microbic taxonomy and evaluating minimum inhibitory concentrations (MICs). Besides conventional methods, numerous computer-aided frameworks have been recently developed using computational advances and public data sources of clinical antimicrobial resistance. In this study, we introduce eMIC-AntiKP, a computational framework specifically designed to predict the MIC values of 20 antibiotics towards Klebsiella pneumoniae. Our prediction models were constructed using convolutional neural networks and k-mer counting-based features. The model for cefepime has the most limited performance with a test 1-tier accuracy of 0.49, while the model for ampicillin has the highest performance with a test 1-tier accuracy of 1.00. Most models have satisfactory performance, with test accuracies ranging from about 0.70-0.90. The significance of eMIC-AntiKP is the effective utilization of computing resources to make it a compact and portable tool for most moderately configured computers. We provide users with two options, including an online web server for basic analysis and an offline package for deeper analysis and technical modification.
Collapse
Affiliation(s)
- Quang H. Nguyen
- School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi 100000, Viet Nam
| | - Hoang H. Ngo
- School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi 100000, Viet Nam
| | - Thanh-Hoang Nguyen-Vo
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
| | - Trang T.T. Do
- School of Innovation, Design and Technology, Wellington Institute of Technology, Lower Hutt 5012, New Zealand
| | - Susanto Rahardja
- School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China,Infocomm Technology Cluster, Singapore Institute of Technology, Singapore 138683, Singapore,Corresponding author at: School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China.
| | - Binh P. Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand,Corresponding author.
| |
Collapse
|
3
|
Wang L, You ZH, Li JQ, Huang YA. IMS-CDA: Prediction of CircRNA-Disease Associations From the Integration of Multisource Similarity Information With Deep Stacked Autoencoder Model. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:5522-5531. [PMID: 33027025 DOI: 10.1109/tcyb.2020.3022852] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Emerging evidence indicates that circular RNA (circRNA) has been an indispensable role in the pathogenesis of human complex diseases and many critical biological processes. Using circRNA as a molecular marker or therapeutic target opens up a new avenue for our treatment and detection of human complex diseases. The traditional biological experiments, however, are usually limited to small scale and are time consuming, so the development of an effective and feasible computational-based approach for predicting circRNA-disease associations is increasingly favored. In this study, we propose a new computational-based method, called IMS-CDA, to predict potential circRNA-disease associations based on multisource biological information. More specifically, IMS-CDA combines the information from the disease semantic similarity, the Jaccard and Gaussian interaction profile kernel similarity of disease and circRNA, and extracts the hidden features using the stacked autoencoder (SAE) algorithm of deep learning. After training in the rotation forest (RF) classifier, IMS-CDA achieves 88.08% area under the ROC curve with 88.36% accuracy at the sensitivity of 91.38% on the CIRCR2Disease dataset. Compared with the state-of-the-art support vector machine and K -nearest neighbor models and different descriptor models, IMS-CDA achieves the best overall performance. In the case studies, eight of the top 15 circRNA-disease associations with the highest prediction score were confirmed by recent literature. These results indicated that IMS-CDA has an outstanding ability to predict new circRNA-disease associations and can provide reliable candidates for biological experiments.
Collapse
|
4
|
Wang L, You ZH, Zhou X, Yan X, Li HY, Huang YA. NMFCDA: Combining randomization-based neural network with non-negative matrix factorization for predicting CircRNA-disease association. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107629] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
5
|
The Identification of the SARS-CoV-2 Whole Genome: Nine Cases Among Patients in Banten Province, Indonesia. JOURNAL OF PURE AND APPLIED MICROBIOLOGY 2021. [DOI: 10.22207/jpam.15.2.52] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the strain of virus that causes coronavirus disease 2019 (COVID-19), the respiratory illness responsible for the current pandemic. Viral genome sequencing has been widely applied during outbreaks to study the relatedness of this virus to other viruses, its transmission mode, pace, evolution and geographical spread, and also its adaptation to human hosts. To date, more than 90,000 SARS-CoV-2 genome sequences have been uploaded to the GISAID database. The availability of sequencing data along with clinical and geographical data may be useful for epidemiological investigations. In this study, we aimed to analyse the genetic background of SARS-CoV-2 from patients in Indonesia by whole genome sequencing. We examined nine samples from COVID-19 patients with RT-PCR cycle threshold (Ct) of less than 25 using ARTIC Network protocols for Oxford Nanopore’s Gridi On sequencer. The analytical methods were based on the ARTIC multiplex PCR sequencing protocol for COVID-19. In this study, we found that several genetic variants within the nine COVID-19 patient samples. We identified a mutation at position 614 P323L mutation in the ORF1ab gene often found in our severe patient samples. The number of SNPs and their location within the SARS-CoV-2 genome seems to vary. This diversity might be responsible for the virulence of the virus and its clinical manifestation.
Collapse
|
6
|
Wang L, Yan X, You ZH, Zhou X, Li HY, Huang YA. SGANRDA: semi-supervised generative adversarial networks for predicting circRNA-disease associations. Brief Bioinform 2021; 22:6175330. [PMID: 33734296 DOI: 10.1093/bib/bbab028] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 01/18/2021] [Accepted: 01/19/2021] [Indexed: 12/31/2022] Open
Abstract
Emerging research shows that circular RNA (circRNA) plays a crucial role in the diagnosis, occurrence and prognosis of complex human diseases. Compared with traditional biological experiments, the computational method of fusing multi-source biological data to identify the association between circRNA and disease can effectively reduce cost and save time. Considering the limitations of existing computational models, we propose a semi-supervised generative adversarial network (GAN) model SGANRDA for predicting circRNA-disease association. This model first fused the natural language features of the circRNA sequence and the features of disease semantics, circRNA and disease Gaussian interaction profile kernel, and then used all circRNA-disease pairs to pre-train the GAN network, and fine-tune the network parameters through labeled samples. Finally, the extreme learning machine classifier is employed to obtain the prediction result. Compared with the previous supervision model, SGANRDA innovatively introduced circRNA sequences and utilized all the information of circRNA-disease pairs during the pre-training process. This step can increase the information content of the feature to some extent and reduce the impact of too few known associations on the model performance. SGANRDA obtained AUC scores of 0.9411 and 0.9223 in leave-one-out cross-validation and 5-fold cross-validation, respectively. Prediction results on the benchmark dataset show that SGANRDA outperforms other existing models. In addition, 25 of the top 30 circRNA-disease pairs with the highest scores of SGANRDA in case studies were verified by recent literature. These experimental results demonstrate that SGANRDA is a useful model to predict the circRNA-disease association and can provide reliable candidates for biological experiments.
Collapse
Affiliation(s)
- Lei Wang
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China
| | - Xin Yan
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, China
| | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China
| | - Xi Zhou
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China
| | - Hao-Yuan Li
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, China
| | - Yu-An Huang
- Department of Computing, Hong Kong Polytechnic University, Hong Kong, China
| |
Collapse
|
7
|
Wang L, You ZH, Li YM, Zheng K, Huang YA. GCNCDA: A new method for predicting circRNA-disease associations based on Graph Convolutional Network Algorithm. PLoS Comput Biol 2020; 16:e1007568. [PMID: 32433655 PMCID: PMC7266350 DOI: 10.1371/journal.pcbi.1007568] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Revised: 06/02/2020] [Accepted: 03/23/2020] [Indexed: 01/22/2023] Open
Abstract
Numerous evidences indicate that Circular RNAs (circRNAs) are widely involved in the occurrence and development of diseases. Identifying the association between circRNAs and diseases plays a crucial role in exploring the pathogenesis of complex diseases and improving the diagnosis and treatment of diseases. However, due to the complex mechanisms between circRNAs and diseases, it is expensive and time-consuming to discover the new circRNA-disease associations by biological experiment. Therefore, there is increasingly urgent need for utilizing the computational methods to predict novel circRNA-disease associations. In this study, we propose a computational method called GCNCDA based on the deep learning Fast learning with Graph Convolutional Networks (FastGCN) algorithm to predict the potential disease-associated circRNAs. Specifically, the method first forms the unified descriptor by fusing disease semantic similarity information, disease and circRNA Gaussian Interaction Profile (GIP) kernel similarity information based on known circRNA-disease associations. The FastGCN algorithm is then used to objectively extract the high-level features contained in the fusion descriptor. Finally, the new circRNA-disease associations are accurately predicted by the Forest by Penalizing Attributes (Forest PA) classifier. The 5-fold cross-validation experiment of GCNCDA achieved 91.2% accuracy with 92.78% sensitivity at the AUC of 90.90% on circR2Disease benchmark dataset. In comparison with different classifier models, feature extraction models and other state-of-the-art methods, GCNCDA shows strong competitiveness. Furthermore, we conducted case study experiments on diseases including breast cancer, glioma and colorectal cancer. The results showed that 16, 15 and 17 of the top 20 candidate circRNAs with the highest prediction scores were respectively confirmed by relevant literature and databases. These results suggest that GCNCDA can effectively predict potential circRNA-disease associations and provide highly credible candidates for biological experiments. The recognition of circRNA-disease association is the key of disease diagnosis and treatment, and it is of great significance for exploring the pathogenesis of complex diseases. Computational methods can predict the potential disease-related circRNAs quickly and accurately. Based on the hypothesis that circRNA with similar function tends to associate with similar disease, GCNCDA model is proposed to effectively predict the potential association between circRNAs and diseases by combining FastGCN algorithm. The performance of the model was verified by cross-validation experiments, different feature extraction algorithm and classifier models comparison experiments. Furthermore, 16, 15 and 17 of the top 20 candidate circRNAs with the highest prediction scores in disease including breast cancer, glioma and colorectal cancer were respectively confirmed by relevant literature and databases. It is anticipated that GCNCDA model can give priority to the most promising circRNA-disease associations on a large scale to provide reliable candidates for further biological experiments.
Collapse
Affiliation(s)
- Lei Wang
- College of Information Science and Engineering, Zaozhuang University, Zaozhuang, China
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China
- * E-mail: (LW); (ZHY)
| | - Zhu-Hong You
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China
- * E-mail: (LW); (ZHY)
| | - Yang-Ming Li
- Department of Electrical Computer and Telecommunications Engineering Technology, Rochester Institute of Technology, Rochester, United States of America
| | - Kai Zheng
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China
| | - Yu-An Huang
- Department of Computing, Hong Kong Polytechnic University, Hong Kong, China
| |
Collapse
|
8
|
Wang L, You ZH, Huang YA, Huang DS, Chan KCC. An efficient approach based on multi-sources information to predict circRNA–disease associations using deep convolutional neural network. Bioinformatics 2019; 36:4038-4046. [DOI: 10.1093/bioinformatics/btz825] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2019] [Revised: 10/07/2019] [Accepted: 11/21/2019] [Indexed: 12/16/2022] Open
Abstract
Abstract
Motivation
Emerging evidence indicates that circular RNA (circRNA) plays a crucial role in human disease. Using circRNA as biomarker gives rise to a new perspective regarding our diagnosing of diseases and understanding of disease pathogenesis. However, detection of circRNA–disease associations by biological experiments alone is often blind, limited to small scale, high cost and time consuming. Therefore, there is an urgent need for reliable computational methods to rapidly infer the potential circRNA–disease associations on a large scale and to provide the most promising candidates for biological experiments.
Results
In this article, we propose an efficient computational method based on multi-source information combined with deep convolutional neural network (CNN) to predict circRNA–disease associations. The method first fuses multi-source information including disease semantic similarity, disease Gaussian interaction profile kernel similarity and circRNA Gaussian interaction profile kernel similarity, and then extracts its hidden deep feature through the CNN and finally sends them to the extreme learning machine classifier for prediction. The 5-fold cross-validation results show that the proposed method achieves 87.21% prediction accuracy with 88.50% sensitivity at the area under the curve of 86.67% on the CIRCR2Disease dataset. In comparison with the state-of-the-art SVM classifier and other feature extraction methods on the same dataset, the proposed model achieves the best results. In addition, we also obtained experimental support for prediction results by searching published literature. As a result, 7 of the top 15 circRNA–disease pairs with the highest scores were confirmed by literature. These results demonstrate that the proposed model is a suitable method for predicting circRNA–disease associations and can provide reliable candidates for biological experiments.
Availability and implementation
The source code and datasets explored in this work are available at https://github.com/look0012/circRNA-Disease-association.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lei Wang
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| | - Yu-An Huang
- Department of Computing, Hong Kong Polytechnic University, Hong Kong 999077, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Keith C C Chan
- Department of Computing, Hong Kong Polytechnic University, Hong Kong 999077, China
| |
Collapse
|
9
|
Alanazi IO, Al Shehri ZS, Ebrahimie E, Giahi H, Mohammadi-Dehcheshmeh M. Non-coding and coding genomic variants distinguish prostate cancer, castration-resistant prostate cancer, familial prostate cancer, and metastatic castration-resistant prostate cancer from each other. Mol Carcinog 2019; 58:862-874. [PMID: 30644608 DOI: 10.1002/mc.22975] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Revised: 01/07/2019] [Accepted: 01/08/2019] [Indexed: 12/11/2022]
Abstract
A considerable number of deposited variants has provided new possibilities for knowledge discovery in different types of prostate cancer. Here, we analyzed variants located on 3'UTR, 5'UTR, CDs, Intergenic, and Intronic regions in castration-resistant prostate cancer (8496 variants), familial prostate cancer (3241 variants), metastatic castration-resistant prostate cancer (3693 variants), and prostate cancer (16599 variants). Chromosome regions 10p15-p14 and 2p13 were highly enriched (P < 0.00001) for variants located in 3'UTR, 5'UTR, CDs, intergenic, and intronic regions in castration-resistant prostate cancer. In contrast, 10p15-p14, 10q23.3, 12q13.11, 13q12.3, 1q25, and 8p22 regions were enriched (P < 0.001) in familial prostate cancer. In metastatic castration-resistant prostate cancer, 10p15-p14, 10q23.3, 11q22-q23, 14q21.1, and 14q32.13 were highly variant regions (P < 0.001). Chromosome 2 and chromosome 1 hosted many enriched variant regions. AKR1C3, BRCA1, BRCA2, CHGA, CYP19A1, HOXB13, KLK3, and PTEN contained the highest number of 3'UTR, 5'UTR, CDs, Intergenic, and Intronic variants. Network analysis showed that these genes are upstream of important functions including prostate gland development, tumor recurrence, prostate cancer-specific survival, tumor progression, cancer mortality, long-term survival, cancer recurrence, angiogenesis, and AR. Interestingly, all of EGFR, JAK2, NR3C1, PDZD2, and SEMA3C genes had single nucleotide polymorphisms (SNP) in castration-resistant prostate cancer, consistent with high selection pressure on these genes during drug treatment and consequent resistance. High occurrence of variants in 3'UTRs suggests the importance of regulatory variants in different types of prostate cancer; an area that has been neglected compared with coding variants. This study provides a comprehensive overview of genomic regions contributing to different types of prostate cancer.
Collapse
Affiliation(s)
- Ibrahim O Alanazi
- National Center for Biotechnology, Life Science and Environment Research Institute, King Abdulaziz City for Science and Technology (KACST), Riyadh, Saudi Arabia
| | - Zafer S Al Shehri
- Clinical Laboratory Department, College of Applied Medical Sciences, Shaqra University, KSA, Al dawadmi, Saudi Arabia
| | - Esmaeil Ebrahimie
- Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, South Australia, Australia.,School of Information Technology and Mathematical Sciences, Division of Information Technology, Engineering and the Environment, The University of South Australia, Adelaide, SA, Australia.,Institute of Biotechnology, Shiraz University, Shiraz, Iran.,Faculty of Science and Engineering, School of Biological Sciences, Flinders University, Adelaide, SA, Australia
| | - Hassan Giahi
- Institute of Biotechnology, Shiraz University, Shiraz, Iran
| | - Manijeh Mohammadi-Dehcheshmeh
- Australian Centre for Antimicrobial Resistance Ecology, School of Animal and Veterinary Sciences, The University of Adelaide, South Australia, Australia
| |
Collapse
|
10
|
The rs13388259 Intergenic Polymorphism in the Genomic Context of the BCYRN1 Gene Is Associated with Parkinson's Disease in the Hungarian Population. PARKINSONS DISEASE 2018; 2018:9351598. [PMID: 29850016 PMCID: PMC5903343 DOI: 10.1155/2018/9351598] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/18/2017] [Accepted: 03/12/2018] [Indexed: 11/17/2022]
Abstract
Parkinson's disease (PD) is a common neurodegenerative disorder characterized by bradykinesia, resting tremor, and muscle rigidity. To date, approximately 50 genes have been implicated in PD pathogenesis, including both Mendelian genes with rare mutations and low-penetrance genes with common polymorphisms. Previous studies of low-penetrance genes focused on protein-coding genes, and less attention was given to long noncoding RNAs (lncRNAs). In this study, we aimed to investigate the susceptibility roles of lncRNA gene polymorphisms in the development of PD. Therefore, polymorphisms (n=15) of the PINK1-AS, UCHL1-AS, BCYRN1, SOX2-OT, ANRIL and HAR1A lncRNAs genes were genotyped in Hungarian PD patients (n=160) and age- and sex-matched controls (n=167). The rare allele of the rs13388259 intergenic polymorphism, located downstream of the BCYRN1 gene, was significantly more frequent among PD patients than control individuals (OR = 2.31; p=0.0015). In silico prediction suggested that this polymorphism is located in a noncoding region close to the binding site of the transcription factor HNF4A, which is a central regulatory hub gene that has been shown to be upregulated in the peripheral blood of PD patients. The rs13388259 polymorphism may interfere with the binding affinity of transcription factor HNF4A, potentially resulting in abnormal expression of target genes, such as BCYRN1.
Collapse
|
11
|
Boudellioua I, Mahamad Razali RB, Kulmanov M, Hashish Y, Bajic VB, Goncalves-Serra E, Schoenmakers N, Gkoutos GV, Schofield PN, Hoehndorf R. Semantic prioritization of novel causative genomic variants. PLoS Comput Biol 2017; 13:e1005500. [PMID: 28414800 PMCID: PMC5411092 DOI: 10.1371/journal.pcbi.1005500] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2016] [Revised: 05/01/2017] [Accepted: 04/04/2017] [Indexed: 12/14/2022] Open
Abstract
Discriminating the causative disease variant(s) for individuals with inherited or de novo mutations presents one of the main challenges faced by the clinical genetics community today. Computational approaches for variant prioritization include machine learning methods utilizing a large number of features, including molecular information, interaction networks, or phenotypes. Here, we demonstrate the PhenomeNET Variant Predictor (PVP) system that exploits semantic technologies and automated reasoning over genotype-phenotype relations to filter and prioritize variants in whole exome and whole genome sequencing datasets. We demonstrate the performance of PVP in identifying causative variants on a large number of synthetic whole exome and whole genome sequences, covering a wide range of diseases and syndromes. In a retrospective study, we further illustrate the application of PVP for the interpretation of whole exome sequencing data in patients suffering from congenital hypothyroidism. We find that PVP accurately identifies causative variants in whole exome and whole genome sequencing datasets and provides a powerful resource for the discovery of causal variants. We address the problem of how to distinguish which of the many thousands of DNA sequence variants carried by an individual with a rare disease is responsible for the disease phenotypes. This can help clinicians arrive at a diagnosis, but also can be instrumental in improving our understanding of the pathobiology of the disease. Many methods are currently available to help with the problem of determining causative variant, using information about evolutionary conservation and prediction of the functional consequences of the sequence variant. We have developed a novel algorithm (PVP) which augments existing strategies by using the similarity of the patients phenotype to known phenotype-genotype data in human and model organism databases to further rank potential candidate genes. In a retrospective study, we apply PVP to the interpretation of whole exome sequencing data in patients suffering from congenital hypothyroidism, and find that PVP accurately identifies causative variants in whole exome and whole genome sequencing datasets and provides a powerful resource for the discovery of causal variants.
Collapse
Affiliation(s)
- Imane Boudellioua
- King Abdullah University of Science and Technology, Computer, Electrical & Mathematical Sciences and Engineering Division, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Rozaimi B. Mahamad Razali
- King Abdullah University of Science and Technology, Computer, Electrical & Mathematical Sciences and Engineering Division, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Maxat Kulmanov
- King Abdullah University of Science and Technology, Computer, Electrical & Mathematical Sciences and Engineering Division, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Yasmeen Hashish
- King Abdullah University of Science and Technology, Computer, Electrical & Mathematical Sciences and Engineering Division, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Vladimir B. Bajic
- King Abdullah University of Science and Technology, Computer, Electrical & Mathematical Sciences and Engineering Division, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Eva Goncalves-Serra
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, United Kingdom
| | - Nadia Schoenmakers
- University of Cambridge Metabolic Research Laboratories, Wellcome Trust—Medical Research Council, Institute of Metabolic Science, Addenbrooke’s Hospital, Cambridge, United Kingdom
| | - Georgios V. Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, United Kingdom
- Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, Birmingham, United Kingdom
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth, United Kingdom
- * E-mail: (GVG); (PNS); (RH)
| | - Paul N. Schofield
- Department of Physiology, Development & Neuroscience, University of Cambridge, Cambridge, United Kingdom
- * E-mail: (GVG); (PNS); (RH)
| | - Robert Hoehndorf
- King Abdullah University of Science and Technology, Computer, Electrical & Mathematical Sciences and Engineering Division, Computational Bioscience Research Center, Thuwal, Saudi Arabia
- * E-mail: (GVG); (PNS); (RH)
| |
Collapse
|
12
|
Singhal A, Simmons M, Lu Z. Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine. PLoS Comput Biol 2016; 12:e1005017. [PMID: 27902695 PMCID: PMC5130168 DOI: 10.1371/journal.pcbi.1005017] [Citation(s) in RCA: 66] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2016] [Accepted: 06/04/2016] [Indexed: 11/23/2022] Open
Abstract
The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F1-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships. To provide personalized health care it is important to understand patients’ genomic variations and the effect these variants have in protecting or predisposing patients to disease. Several projects aim at providing this information by manually curating such genotype-phenotype relationships in organized databases using data from clinical trials and biomedical literature. However, the exponentially increasing size of biomedical literature and the limited ability of manual curators to discover the genotype-phenotype relationships “hidden” in text has led to delays in keeping such databases updated with the current findings. The result is a bottleneck in leveraging valuable information that is currently available to develop personalized health care solutions. In the past, a few computational techniques have attempted to speed up the curation efforts by using text mining techniques to automatically mine genotype-phenotype information from biomedical literature. However, such computational approaches have not been able to achieve accuracy levels sufficient to make them appealing for practical use. In this work, we present a highly accurate machine-learning-based text mining approach for mining complete genotype-phenotype relationships from biomedical literature. We test the performance of this approach on ten well-known diseases and demonstrate the validity of our approach and its potential utility for practical purposes. We are currently working towards generating genotype-phenotype relationships for all PubMed data with the goal of developing an exhaustive database of all the known diseases in life science. We believe that this work will provide very important and needed support for implementation of personalized health care using genomic data.
Collapse
Affiliation(s)
- Ayush Singhal
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Michael Simmons
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
- * E-mail:
| |
Collapse
|
13
|
Associations of Genetic Variants at Nongenic Susceptibility Loci with Breast Cancer Risk and Heterogeneity by Tumor Subtype in Southern Han Chinese Women. BIOMED RESEARCH INTERNATIONAL 2016; 2016:3065493. [PMID: 27022606 PMCID: PMC4789034 DOI: 10.1155/2016/3065493] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/17/2015] [Revised: 01/06/2016] [Accepted: 02/04/2016] [Indexed: 12/05/2022]
Abstract
Current understanding of cancer genomes is mainly “gene centric.” However, GWAS have identified some nongenic breast cancer susceptibility loci. Validation studies showed inconsistent results among different populations. To further explore this inconsistency and to investigate associations by intrinsic subtype (Luminal-A, Luminal-B, ER−&PR−&HER2+, and triple negative) among Southern Han Chinese women, we genotyped five nongenic polymorphisms (2q35: rs13387042, 5p12: rs981782 and rs4415084, and 8q24: rs1562430 and rs13281615) using MassARRAY IPLEX platform in 609 patients and 882 controls. Significant associations with breast cancer were observed for rs13387042 and rs4415084 with OR (95% CI) per-allele 1.29 (1.00–1.66) and 0.83 (0.71–0.97), respectively. In subtype specific analysis, rs13387042 (per-allele adjusted OR = 1.36, 95% CI = 1.00–1.87) and rs4415084 (per-allele adjusted OR = 0.82, 95% CI = 0.66–1.00) showed slightly significant association with Luminal-A subtype; however, only rs13387042 was associated with ER−&PR−&HER2+ tumors (per-allele adjusted OR = 1.55, 95% CI = 1.00–2.40), and none of them were linked to Luminal-B and triple negative subtype. Collectively, nongenic SNPs were heterogeneous according to the intrinsic subtype. Further studies with larger datasets along with intrinsic subtype categorization should explore and confirm the role of these variants in increasing breast cancer risk.
Collapse
|
14
|
Hamed AA, Ayer AA, Clark EM, Irons EA, Taylor GT, Zia A. Measuring climate change on Twitter using Google’s algorithm: perception and events. INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS 2015. [DOI: 10.1108/ijwis-08-2015-0025] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Purpose
– The purpose of this paper is to test the hypothesis of whether more complex and emergent hashtags can be sufficient pointers to climate change events. Human-induced climate change is one of this century’s greatest unbalancing forces to have affected our planet. Capturing the public awareness of climate change on Twitter has proven to be significant. In a previous research, it was demonstrated by the authors that public awareness is prominently expressed in the form of hashtags that uses more than one bigram (i.e. a climate change term). The research finding showed that this awareness is expressed by more complex terms (e.g. “climate change”). It was learned that the awareness was dominantly expressed using the hashtag: #ClimateChange.
Design/methodology/approach
– The methods demonstrated here use objective computational approaches [i.e. Google’s ranking algorithm and Information Retrieval measures (e.g. TFIDF)] to detect and rank the emerging events.
Findings
– The results shows a clear significant evidence for the events signaled using emergent hashtags and how globally influential they are. The research detected the Earth Day, 2015, which was signaled using the hashtag #EarthDay. Clearly, this is a day that is globally observed by the worldwide population.
Originality/value
– It was proven that these computational methods eliminate the subjectivity errors associated with humans and provide inexpensive solution for event detection on Twitter. Indeed, the approach used here can also be applicable to other types of event detections, beyond climate change, and surely applicable to other social media platforms that support the use of hashtags (e.g. Facebook). The paper explains, in great detail, the methods and all the numerous events detected.
Collapse
|
15
|
Shameer K, Tripathi LP, Kalari KR, Dudley JT, Sowdhamini R. Interpreting functional effects of coding variants: challenges in proteome-scale prediction, annotation and assessment. Brief Bioinform 2015; 17:841-62. [PMID: 26494363 DOI: 10.1093/bib/bbv084] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2015] [Indexed: 12/20/2022] Open
Abstract
Accurate assessment of genetic variation in human DNA sequencing studies remains a nontrivial challenge in clinical genomics and genome informatics. Ascribing functional roles and/or clinical significances to single nucleotide variants identified from a next-generation sequencing study is an important step in genome interpretation. Experimental characterization of all the observed functional variants is yet impractical; thus, the prediction of functional and/or regulatory impacts of the various mutations using in silico approaches is an important step toward the identification of functionally significant or clinically actionable variants. The relationships between genotypes and the expressed phenotypes are multilayered and biologically complex; such relationships present numerous challenges and at the same time offer various opportunities for the design of in silico variant assessment strategies. Over the past decade, many bioinformatics algorithms have been developed to predict functional consequences of single nucleotide variants in the protein coding regions. In this review, we provide an overview of the bioinformatics resources for the prediction, annotation and visualization of coding single nucleotide variants. We discuss the currently available approaches and major challenges from the perspective of protein sequence, structure, function and interactions that require consideration when interpreting the impact of putatively functional variants. We also discuss the relevance of incorporating integrated workflows for predicting the biomedical impact of the functionally important variations encoded in a genome, exome or transcriptome. Finally, we propose a framework to classify variant assessment approaches and strategies for incorporation of variant assessment within electronic health records.
Collapse
|