1
|
Feng Y, Zhang Z, Tang J, Chen Y, Hu D, Huang X, Li F. Ferroptosis-related biomarkers for adamantinomatous craniopharyngioma treatment: conclusions from machine learning techniques. Front Endocrinol (Lausanne) 2024; 15:1362278. [PMID: 39605941 PMCID: PMC11598535 DOI: 10.3389/fendo.2024.1362278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Accepted: 10/25/2024] [Indexed: 11/29/2024] Open
Abstract
Introduction Adamantinomatous craniopharyngioma (ACP) is difficult to cure completely and prone to recurrence after surgery. Ferroptosis as an iron-dependent programmed cell death, may be a critical process in ACP. The study aimed to screen diagnostic markers related to ferroptosis in ACP to improve diagnostic accuracy. Methods Gene expression profiles of ACP were obtained from the gene expression omnibus (GEO) database. Limma package was used to analyze the differently expressed genes (DEGs). The intersection of DEGs and ferroptosis-related factors was obtained as differently expressed ferroptosis-related genes (DEFRGs). Enrichment analysis was processed, including Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), disease ontology (DO), gene set enrichment analysis (GSEA), and Gene Set Variation Analysis (GSVA) analysis. Machine learning algorithms were undertaken for screening diagnostic markers associated with ferroptosis in ACP. The levels of DEFRGs were verified in ACP patients. A nomogram was drawn to predict the relationship between key DEFRG expression and risk of disease. The disease groups were then clustered by consensus clustering analysis. Results DEGs were screened between ACP and normal samples. Ferroptosis-related factors were obtained from the FerrDb V2 and GeneCard databases. The correlation between DEFRGs and ferroptosis markers was also confirmed. A total of 6 overlapped DEFRGs were obtained. Based on the results of the nomogram, CASP8, KRT16, KRT19, and TP63 were the protective factors of the risk of disease, while GOT1 and TFAP2C were the risk factors. According to screened DEFRGs, the consensus clustering matrix was differentiated, and the number of clusters was stable. CASP8, KRT16, KRT19, and TP63, were upregulated in ACP patients, while GOT1 was downregulated. CASP8, KRT16, KRT19, TP63, CASP8, and GOT1 affect multiple ferroptosis marker genes. The combination of these genes might be the biomarker for ACP diagnosis via participating ferroptosis process. Discussion Ferroptosis-related genes, including CASP8, KRT16, KRT19, TP63, and GOT1 were the potential markers for ACP, which lays the theoretical foundation for ACP diagnosis.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Fangping Li
- Department of Endocrinology, The Seventh Affiliated Hospital of Sun Yat-sen University, Shenzhen, China
| |
Collapse
|
2
|
Hsieh AR, Tsai CY. Biomedical literature mining: graph kernel-based learning for gene-gene interaction extraction. Eur J Med Res 2024; 29:404. [PMID: 39095899 PMCID: PMC11297645 DOI: 10.1186/s40001-024-01983-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Accepted: 07/17/2024] [Indexed: 08/04/2024] Open
Abstract
The supervised machine learning method is often used for biomedical relationship extraction. The disadvantage is that it requires much time and money to manually establish an annotated dataset. Based on distant supervision, the knowledge base is combined with the corpus, thus, the training corpus can be automatically annotated. As many biomedical databases provide knowledge bases for study with a limited number of annotated corpora, this method is practical in biomedicine. The clinical significance of each patient's genetic makeup can be understood based on the healthcare provider's genetic database. Unfortunately, the lack of previous biomedical relationship extraction studies focuses on gene-gene interaction. The main purpose of this study is to develop extraction methods for gene-gene interactions that can help explain the heritability of human complex diseases. This study referred to the information on gene-gene interactions in the KEGG PATHWAY database, the abstracts in PubMed were adopted to generate the training sample set, and the graph kernel method was adopted to extract gene-gene interactions. The best assessment result was an F1-score of 0.79. Our developed distant supervision method automatically finds sentences through the corpus without manual labeling for extracting gene-gene interactions, which can effectively reduce the time cost for manual annotation data; moreover, the relationship extraction method based on a graph kernel can be successfully applied to extract gene-gene interactions. In this way, the results of this study are expected to help achieve precision medicine.
Collapse
Affiliation(s)
- Ai-Ru Hsieh
- Department of Statistics, Tamkang University, Tamsui District, New Taipei City, 251301, Taiwan.
| | - Chen-Yu Tsai
- Department of Statistics, Tamkang University, Tamsui District, New Taipei City, 251301, Taiwan
| |
Collapse
|
3
|
Ismail E, Gad W, Hashem M. HEC-ASD: a hybrid ensemble-based classification model for predicting autism spectrum disorder disease genes. BMC Bioinformatics 2022; 23:554. [PMID: 36544099 PMCID: PMC9768984 DOI: 10.1186/s12859-022-05099-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Accepted: 12/06/2022] [Indexed: 12/24/2022] Open
Abstract
PURPOSE Autism spectrum disorder (ASD) is the most prevalent disease today. The causes of its infection may be attributed to genetic causes by 80% and environmental causes by 20%. In spite of this, the majority of the current research is concerned with environmental causes, and the least proportion with the genetic causes of the disease. Autism is a complex disease, which makes it difficult to identify the genes that cause the disease. METHODS Hybrid ensemble-based classification (HEC-ASD) model for predicting ASD genes using gradient boosting machines is proposed. The proposed model utilizes gene ontology (GO) to construct a gene functional similarity matrix using hybrid gene similarity (HGS) method. HGS measures the semantic similarity between genes effectively. It combines the graph-based method, such as Wang method with the number of directed children's nodes of gene term from GO. Moreover, an ensemble gradient boosting classifier is adapted to enhance the prediction of genes forming a robust classification model. RESULTS The proposed model is evaluated using the Simons Foundation Autism Research Initiative (SFARI) gene database. The experimental results are promising as they improve the classification performance for predicting ASD genes. The results are compared with other approaches that used gene regulatory network (GRN), protein to protein interaction network (PPI), or GO. The HEC-ASD model reaches the highest prediction accuracy of 0.88% using ensemble learning classifiers. CONCLUSION The proposed model demonstrates that ensemble learning technique using gradient boosting is effective in predicting autism spectrum disorder genes. Moreover, the HEC-ASD model utilized GO rather than using PPI network and GRN.
Collapse
Affiliation(s)
- Eman Ismail
- grid.7269.a0000 0004 0621 1570Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| | - Walaa Gad
- grid.7269.a0000 0004 0621 1570Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| | - Mohamed Hashem
- grid.7269.a0000 0004 0621 1570Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| |
Collapse
|
4
|
IoMT-Based Mitochondrial and Multifactorial Genetic Inheritance Disorder Prediction Using Machine Learning. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:2650742. [PMID: 35909844 PMCID: PMC9334098 DOI: 10.1155/2022/2650742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/15/2022] [Accepted: 07/04/2022] [Indexed: 11/18/2022]
Abstract
A genetic disorder is a serious disease that affects a large number of individuals around the world. There are various types of genetic illnesses, however, we focus on mitochondrial and multifactorial genetic disorders for prediction. Genetic illness is caused by a number of factors, including a defective maternal or paternal gene, excessive abortions, a lack of blood cells, and low white blood cell count. For premature or teenage life development, early detection of genetic diseases is crucial. Although it is difficult to forecast genetic disorders ahead of time, this prediction is very critical since a person's life progress depends on it. Machine learning algorithms are used to diagnose genetic disorders with high accuracy utilizing datasets collected and constructed from a large number of patient medical reports. A lot of studies have been conducted recently employing genome sequencing for illness detection, but fewer studies have been presented using patient medical history. The accuracy of existing studies that use a patient's history is restricted. The internet of medical things (IoMT) based proposed model for genetic disease prediction in this article uses two separate machine learning algorithms: support vector machine (SVM) and K-Nearest Neighbor (KNN). Experimental results show that SVM has outperformed the KNN and existing prediction methods in terms of accuracy. SVM achieved an accuracy of 94.99% and 86.6% for training and testing, respectively.
Collapse
|
5
|
Wang X, Cao X, Feng Y, Guo M, Yu G, Wang J. ELSSI: parallel SNP-SNP interactions detection by ensemble multi-type detectors. Brief Bioinform 2022; 23:6607749. [PMID: 35696639 DOI: 10.1093/bib/bbac213] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 04/18/2022] [Accepted: 05/07/2022] [Indexed: 12/11/2022] Open
Abstract
With the development of high-throughput genotyping technology, single nucleotide polymorphism (SNP)-SNP interactions (SSIs) detection has become an essential way for understanding disease susceptibility. Various methods have been proposed to detect SSIs. However, given the disease complexity and bias of individual SSI detectors, these single-detector-based methods are generally unscalable for real genome-wide data and with unfavorable results. We propose a novel ensemble learning-based approach (ELSSI) that can significantly reduce the bias of individual detectors and their computational load. ELSSI randomly divides SNPs into different subsets and evaluates them by multi-type detectors in parallel. Particularly, ELSSI introduces a four-stage pipeline (generate, score, switch and filter) to iteratively generate new SNP combination subsets from SNP subsets, score the combination subset by individual detectors, switch high-score combinations to other detectors for re-scoring, then filter out combinations with low scores. This pipeline makes ELSSI able to detect high-order SSIs from large genome-wide datasets. Experimental results on various simulated and real genome-wide datasets show the superior efficacy of ELSSI to state-of-the-art methods in detecting SSIs, especially for high-order ones. ELSSI is applicable with moderate PCs on the Internet and flexible to assemble new detectors. The code of ELSSI is available at https://www.sdu-idea.cn/codes.php?name=ELSSI.
Collapse
Affiliation(s)
- Xin Wang
- School of Software, Shandong University, Jinan 250101, China.,Joint SDU-NTU Centre for Artificial Intelligence Research(C-FAIR), Shandong University, Jinan 250101, China
| | - Xia Cao
- College of Computer and Information Sciences, Southwest University, Chongqing 400715, China
| | - Yuantao Feng
- College of Computer and Information Sciences, Southwest University, Chongqing 400715, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China
| | - Guoxian Yu
- School of Software, Shandong University, Jinan 250101, China
| | - Jun Wang
- Joint SDU-NTU Centre for Artificial Intelligence Research(C-FAIR), Shandong University, Jinan 250101, China
| |
Collapse
|
6
|
Helenius M, Vaitkeviciene G, Abrahamsson J, Jonsson ÓG, Lund B, Harila-Saari A, Vettenranta K, Mikkel S, Stanulla M, Lopez-Lopez E, Waanders E, Madsen HO, Marquart HV, Modvig S, Gupta R, Schmiegelow K, Nielsen RL. Characteristics of white blood cell count in acute lymphoblastic leukemia: A COST LEGEND phenotype-genotype study. Pediatr Blood Cancer 2022; 69:e29582. [PMID: 35316565 DOI: 10.1002/pbc.29582] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Revised: 12/20/2021] [Accepted: 12/31/2021] [Indexed: 11/10/2022]
Abstract
BACKGROUND White blood cell count (WBC) as a measure of extramedullary leukemic cell survival is a well-known prognostic factor in acute lymphoblastic leukemia (ALL), but its biology, including impact of host genome variants, is poorly understood. METHODS We included patients treated with the Nordic Society of Paediatric Haematology and Oncology (NOPHO) ALL-2008 protocol (N = 2347, 72% were genotyped by Illumina Omni2.5exome-8-Bead chip) aged 1-45 years, diagnosed with B-cell precursor (BCP-) or T-cell ALL (T-ALL) to investigate the variation in WBC. Spline functions of WBC were fitted correcting for association with age across ALL subgroups of immunophenotypes and karyotypes. The residuals between spline WBC and actual WBC were used to identify WBC-associated germline genetic variants in a genome-wide association study (GWAS) while adjusting for age and ALL subtype associations. RESULTS We observed an overall inverse correlation between age and WBC, which was stronger for the selected patient subgroups of immunophenotype and karyotypes (ρBCP-ALL = -.17, ρT-ALL = -.19; p < 3 × 10-4 ). Spline functions fitted to age, immunophenotype, and karyotype explained WBC variation better than age alone (ρ = .43, p << 2 × 10-6 ). However, when the spline-adjusted WBC residuals were used as phenotype, no GWAS significant associations were found. Based on available annotation, the top 50 genetic variants suggested effects on signal transduction, translation initiation, cell development, and proliferation. CONCLUSION These results indicate that host genome variants do not strongly influence WBC across ALL subsets, and future studies of why some patients are more prone to hyperleukocytosis should be performed within specific ALL subsets that apply more complex analyses to capture potential germline variant interactions and impact on WBC.
Collapse
Affiliation(s)
- Marianne Helenius
- Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Copenhagen, Denmark.,Department of Pediatrics and Adolescent Medicine, University Hospital Rigshospitalet, Copenhagen, Denmark
| | - Goda Vaitkeviciene
- Vilnius University Hospital Santaros Klinikos Center for Pediatric Oncology and Hematology and Vilnius University, Vilnius, Lithuania
| | - Jonas Abrahamsson
- Department of Paediatrics, Institution for Clinical Sciences, Sahlgrenska University Hospital, Gothenburg, Sweden
| | | | - Bendik Lund
- Department of Pediatrics, St. Olavs Hospital, Trondheim, Norway
| | - Arja Harila-Saari
- Department of Women's and Children's Health, Uppsala University, Uppsala, Sweden
| | - Kim Vettenranta
- University of Helsinki and Children´s Hospital, University of Helsinki, Helsinki, Finland
| | - Sirje Mikkel
- Department of Hematology and Oncology, University of Tartu, Tartu, Estonia
| | - Martin Stanulla
- Department of Pediatric Hematology and Oncology, Hannover Medical School, Hannover, Germany
| | - Elixabet Lopez-Lopez
- Department of Genetics, Physical Anthropology and Animal Physiology, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), Leioa, Spain.,Pediatric Oncology Group, BioCruces Bizkaia Health Research Institute, Barakaldo, Spain
| | - Esmé Waanders
- Department of Genetics, University Medical Center Utrecht, Utrecht, The Netherlands.,Princess Máxima Center for Pediatric Oncology, Utrecht, The Netherlands
| | - Hans O Madsen
- Department of Clinical Immunology, University Hospital Rigshospitalet, Copenhagen, Denmark
| | - Hanne Vibeke Marquart
- Department of Clinical Immunology, University Hospital Rigshospitalet, Copenhagen, Denmark
| | - Signe Modvig
- Department of Clinical Immunology, University Hospital Rigshospitalet, Copenhagen, Denmark
| | - Ramneek Gupta
- Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Copenhagen, Denmark.,Novo Nordisk Research Centre Oxford, Oxford, UK
| | - Kjeld Schmiegelow
- Department of Pediatrics and Adolescent Medicine, University Hospital Rigshospitalet, Copenhagen, Denmark.,Institute of Clinical Medicine, Faculty of Medicine, University of Copenhagen, Copenhagen, Denmark
| | - Rikke Linnemann Nielsen
- Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Copenhagen, Denmark.,Department of Pediatrics and Adolescent Medicine, University Hospital Rigshospitalet, Copenhagen, Denmark.,Novo Nordisk Research Centre Oxford, Oxford, UK
| |
Collapse
|
7
|
Multi-Objective Artificial Bee Colony Algorithm Based on Scale-Free Network for Epistasis Detection. Genes (Basel) 2022; 13:genes13050871. [PMID: 35627256 PMCID: PMC9140669 DOI: 10.3390/genes13050871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2022] [Revised: 04/30/2022] [Accepted: 05/10/2022] [Indexed: 12/04/2022] Open
Abstract
In genome-wide association studies, epistasis detection is of great significance for the occurrence and diagnosis of complex human diseases, but it also faces challenges such as high dimensionality and a small data sample size. In order to cope with these challenges, several swarm intelligence methods have been introduced to identify epistasis in recent years. However, the existing methods still have some limitations, such as high-consumption and premature convergence. In this study, we proposed a multi-objective artificial bee colony (ABC) algorithm based on the scale-free network (SFMOABC). The SFMOABC incorporates the scale-free network into the ABC algorithm to guide the update and selection of solutions. In addition, the SFMOABC uses mutual information and the K2-Score of the Bayesian network as objective functions, and the opposition-based learning strategy is used to improve the search ability. Experiments were performed on both simulation datasets and a real dataset of age-related macular degeneration (AMD). The results of the simulation experiments showed that the SFMOABC has better detection power and efficiency than seven other epistasis detection methods. In the real AMD data experiment, most of the single nucleotide polymorphism combinations detected by the SFMOABC have been shown to be associated with AMD disease. Therefore, SFMOABC is a promising method for epistasis detection.
Collapse
|
8
|
Vasta R, Chia R, Traynor BJ, Chiò A. Unraveling the complex interplay between genes, environment, and climate in ALS. EBioMedicine 2022; 75:103795. [PMID: 34974309 PMCID: PMC8728044 DOI: 10.1016/j.ebiom.2021.103795] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 12/03/2021] [Accepted: 12/16/2021] [Indexed: 12/11/2022] Open
Abstract
Various genetic and environmental risk factors have been implicated in the pathogenesis of amyotrophic lateral sclerosis (ALS). Despite this, the cause of most ALS cases remains obscure. In this review, we describe the current evidence implicating genetic and environmental factors in motor neuron degeneration. While the risk exerted by many environmental factors may appear small, their effect could be magnified by the presence of a genetic predisposition. We postulate that gene-environment interactions account for at least a portion of the unknown etiology in ALS. Climate underlies multiple environmental factors, some of which have been implied in ALS etiology, and the impact of global temperature increase on the gene-environment interactions should be carefully monitored. We describe the main concepts underlying such interactions. Although a lack of large cohorts with detailed genetic and environmental information hampers the search for gene-environment interactions, newer algorithms and machine learning approaches offer an opportunity to break this stalemate. Understanding how genetic and environmental factors interact to cause ALS may ultimately pave the way towards precision medicine becoming an integral part of ALS care.
Collapse
Affiliation(s)
- Rosario Vasta
- ALS Center, Department of Neuroscience "Rita Levi Montalcini", University of Turin, via Cherasco 15, Turin 1026, Italy; Neuromuscular Diseases Research Section, Laboratory of Neurogenetics, National Institute on Aging (NIH), Bethesda, MD 20892, USA
| | - Ruth Chia
- Neuromuscular Diseases Research Section, Laboratory of Neurogenetics, National Institute on Aging (NIH), Bethesda, MD 20892, USA
| | - Bryan J Traynor
- Neuromuscular Diseases Research Section, Laboratory of Neurogenetics, National Institute on Aging (NIH), Bethesda, MD 20892, USA; Reta Lila Weston Institute, UCL Queen Square Institute of Neurology, University College London, London WC1N 1PJ, UK; Department of Neurology, Johns Hopkins University Medical Center, Baltimore, MD 21287, USA; National Institute of Neurological Disorders and Stroke, NIH, Bethesda, MD, USA; ASO Rapid Development Laboratory, Therapeutics Development Branch, National Center for Advancing Translational Sciences, NIH, Rockville, MD, USA
| | - Adriano Chiò
- ALS Center, Department of Neuroscience "Rita Levi Montalcini", University of Turin, via Cherasco 15, Turin 1026, Italy; Institute of Cognitive Sciences and Technologies, C.N.R., Rome 00185, Italy; Neurology 1, AOU Città della Salute e della Scienza di Torino, Turin, Italy.
| |
Collapse
|
9
|
Lyu R, Sun J, Xu D, Jiang Q, Wei C, Zhang Y. GESLM algorithm for detecting causal SNPs in GWAS with multiple phenotypes. Brief Bioinform 2021; 22:6329404. [PMID: 34323927 DOI: 10.1093/bib/bbab276] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2021] [Revised: 06/05/2021] [Accepted: 06/29/2021] [Indexed: 12/13/2022] Open
Abstract
With the development of genome-wide association studies, how to gain information from a large scale of data has become an issue of common concern, since traditional methods are not fully developed to solve problems such as identifying loci-to-loci interactions (also known as epistasis). Previous epistatic studies mainly focused on local information with a single outcome (phenotype), while in this paper, we developed a two-stage global search algorithm, Greedy Equivalence Search with Local Modification (GESLM), to implement a global search of directed acyclic graph in order to identify genome-wide epistatic interactions with multiple outcome variables (phenotypes) in a case-control design. GESLM integrates the advantages of score-based methods and constraint-based methods to learn the phenotype-related Bayesian network and is powerful and robust to find the interaction structures that display both genetic associations with phenotypes and gene interactions. We compared GESLM with some common phenotype-related loci detecting methods in simulation studies. The results showed that our method improved the accuracy and efficiency compared with others, especially in an unbalanced case-control study. Besides, its application on the UK Biobank dataset suggested that our algorithm has great performance when handling genome-wide association data with more than one phenotype.
Collapse
Affiliation(s)
- Ruiqi Lyu
- Shanghai Jiao Tong University, Department of Bioinformatics and Biostatistics, Shanghai, 200240, China
| | - Jianle Sun
- Shanghai Jiao Tong University, Department of Bioinformatics and Biostatistics, Shanghai, 200240, China
| | - Dong Xu
- Shanghai Jiao Tong University, Department of Bioinformatics and Biostatistics, Shanghai, 200240, China
| | - Qianxue Jiang
- Shanghai Jiao Tong University, Department of Bioinformatics and Biostatistics, Shanghai, 200240, China
| | - Chaochun Wei
- Shanghai Jiao Tong University, Department of Bioinformatics and Biostatistics, Shanghai, 200240, China
| | - Yue Zhang
- Shanghai Jiao Tong University, Department of Bioinformatics and Biostatistics, Shanghai, 200240, China
| |
Collapse
|
10
|
Okazaki A, Horpaopan S, Zhang Q, Randesi M, Ott J. Genotype Pattern Mining for Pairs of Interacting Variants Underlying Digenic Traits. Genes (Basel) 2021; 12:1160. [PMID: 34440333 PMCID: PMC8391494 DOI: 10.3390/genes12081160] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Revised: 07/23/2021] [Accepted: 07/27/2021] [Indexed: 12/15/2022] Open
Abstract
Some genetic diseases ("digenic traits") are due to the interaction between two DNA variants, which presumably reflects biochemical interactions. For example, certain forms of Retinitis Pigmentosa, a type of blindness, occur in the presence of two mutant variants, one each in the ROM1 and RDS genes, while the occurrence of only one such variant results in a normal phenotype. Detecting variant pairs underlying digenic traits by standard genetic methods is difficult and is downright impossible when individual variants alone have minimal effects. Frequent pattern mining (FPM) methods are known to detect patterns of items. We make use of FPM approaches to find pairs of genotypes (from different variants) that can discriminate between cases and controls. Our method is based on genotype patterns of length two, and permutation testing allows assigning p-values to genotype patterns, where the null hypothesis refers to equal pattern frequencies in cases and controls. We compare different interaction search approaches and their properties on the basis of published datasets. Our implementation of FPM to case-control studies is freely available.
Collapse
Affiliation(s)
- Atsuko Okazaki
- Department of Diagnostics and Therapeutics of Intractable Diseases, Juntendo University, Bunkyo-ku, Tokyo 113-8421, Japan;
- Laboratory of Statistical Genetics, Rockefeller University, New York, NY 10065, USA
| | - Sukanya Horpaopan
- Department of Anatomy, Faculty of Medical Science, Naresuan University, Phitsanulok 65000, Thailand;
| | - Qingrun Zhang
- Department of Mathematics and Statistics, University of Calgary, Calgary, AB T2N 1N4, Canada;
| | - Matthew Randesi
- Laboratory of the Biology of Addictive Diseases, Rockefeller University, New York, NY 10065, USA;
| | - Jurg Ott
- Laboratory of Statistical Genetics, Rockefeller University, New York, NY 10065, USA
| |
Collapse
|
11
|
Chen K, Xu H, Lei Y, Lio P, Li Y, Guo H, Ali Moni M. Integration and interplay of machine learning and bioinformatics approach to identify genetic interaction related to ovarian cancer chemoresistance. Brief Bioinform 2021; 22:6272796. [PMID: 33971668 DOI: 10.1093/bib/bbab100] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Revised: 03/04/2021] [Accepted: 03/06/2021] [Indexed: 11/15/2022] Open
Abstract
Although chemotherapy is the first-line treatment for ovarian cancer (OCa) patients, chemoresistance (CR) decreases their progression-free survival. This paper investigates the genetic interaction (GI) related to OCa-CR. To decrease the complexity of establishing gene networks, individual signature genes related to OCa-CR are identified using a gradient boosting decision tree algorithm. Additionally, the genetic interaction coefficient (GIC) is proposed to measure the correlation of two signature genes quantitatively and explain their joint influence on OCa-CR. Gene pair that possesses high GIC is identified as signature pair. A total of 24 signature gene pairs are selected that include 10 individual signature genes and the influence of signature gene pairs on OCa-CR is explored. Finally, a signature gene pair-based prediction of OCa-CR is identified. The area under curve (AUC) is a widely used performance measure for machine learning prediction. The AUC of signature gene pair reaches 0.9658, whereas the AUC of individual signature gene-based prediction is 0.6823 only. The identified signature gene pairs not only build an efficient GI network of OCa-CR but also provide an interesting way for OCa-CR prediction. This improvement shows that our proposed method is a useful tool to investigate GI related to OCa-CR.
Collapse
Affiliation(s)
- Kexin Chen
- School of Electronics Engineering and Computer Science, Peking University, 100871, Beijing, China
| | - Haoming Xu
- Department of Biomedical Engineering, Duke University, 27708, Durham, United States
| | - Yiming Lei
- School of Electronics Engineering and Computer Science, Peking University, 100871, Beijing, China
| | - Pietro Lio
- Computer Laboratory, University of Cambridge, CB3-0FD, Cambridge, United Kingdom
| | - Yuan Li
- Department of Obstetrics and Gynecology, Peking University Third Hospital, 100083, Beijing, China
| | - Hongyan Guo
- Department of Obstetrics and Gynecology, Peking University Third Hospital, 100083, Beijing, China
| | - Mohammad Ali Moni
- School of Public health and Community Medicine, University of New South Wales, 2052, Sydney, Australia
| |
Collapse
|
12
|
Tandan M, Acharya Y, Pokharel S, Timilsina M. Discovering symptom patterns of COVID-19 patients using association rule mining. Comput Biol Med 2021; 131:104249. [PMID: 33561673 PMCID: PMC7966840 DOI: 10.1016/j.compbiomed.2021.104249] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 01/25/2021] [Accepted: 01/25/2021] [Indexed: 12/16/2022]
Abstract
BACKGROUND The COVID-19 pandemic is a significant public health crisis that is hitting hard on people's health, well-being, and freedom of movement, and affecting the global economy. Scientists worldwide are competing to develop therapeutics and vaccines; currently, three drugs and two vaccine candidates have been given emergency authorization use. However, there are still questions of efficacy with regard to specific subgroups of patients and the vaccine's scalability to the general public. Under such circumstances, understanding COVID-19 symptoms is vital in initial triage; it is crucial to distinguish the severity of cases for effective management and treatment. This study aimed to discover symptom patterns and overall symptom rules, including rules disaggregated by age, sex, chronic condition, and mortality status, among COVID-19 patients. METHODS This study was a retrospective analysis of COVID-19 patient data made available online by the Wolfram Data Repository through May 27, 2020. We applied a widely used rule-based machine learning technique called association rule mining to identify frequent symptoms and define patterns in the rules discovered. RESULT In total, 1,560 patients with COVID-19 were included in the study, with a median age of 52 years. The most frequently occurring symptom was fever (67%), followed by cough (37%), malaise/body soreness (11%), pneumonia (11%), and sore throat (8%). Myocardial infarction, heart failure, and renal disease were present in less than 1% of patients. The top ten significant symptom rules (out of 71 generated) showed cough, septic shock, and respiratory distress syndrome as frequent consequents. If a patient had a breathing problem and sputum production, then, there was higher confidence of that patient having a cough; if cardiac disease, renal disease, or pneumonia was present, then there was a higher confidence of septic shock or respiratory distress syndrome. Symptom rules differed between younger and older patients and between male and female patients. Patients who had chronic conditions or died of COVID-19 had more severe symptom rules than those patients who did not have chronic conditions or survived of COVID-19. Concerning chronic condition rules among 147 patients, if a patient had diabetes, prerenal azotemia, and coronary bypass surgery, there was a certainty of hypertension. CONCLUSION The most frequently reported symptoms in patients with COVID-19 were fever, cough, pneumonia, and sore throat; while 1% had severe symptoms, such as septic shock, respiratory distress syndrome, and respiratory failure. Symptom rules differed by age and sex. Patients with chronic disease and patients who died of COVID-19 had severe symptom rules more specifically, cardiovascular-related symptoms accompanied by pneumonia, fever, and cough as consequents.
Collapse
Affiliation(s)
- Meera Tandan
- Cecil G Sheps Center for Health Service Research, University of North Carolina, Chapel Hill, USA.
| | - Yogesh Acharya
- Western Vascular Institute, Galway University Hospital, Galway, Ireland.
| | - Suresh Pokharel
- The University of Queensland, St Lucia, Queensland, Australia.
| | - Mohan Timilsina
- Data Science Institute, Insight Centre for Data Analytics, National University of Ireland Galway, Ireland.
| |
Collapse
|
13
|
Aristodimou A, Antoniades A, Dardiotis E, Loizidou E, Spyrou G, Votsi C, Kyproula C, Pantzaris M, Grigoriadis N, Hadjigeorgiou G, Kyriakides T, Pattichi C. A Framework for Efficient N-Way Interaction Testing in Case/Control Studies With Categorical Data. IEEE OPEN JOURNAL OF ENGINEERING IN MEDICINE AND BIOLOGY 2021; 2:256-262. [PMID: 35402966 PMCID: PMC8901013 DOI: 10.1109/ojemb.2021.3100416] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2021] [Revised: 07/08/2021] [Accepted: 07/22/2021] [Indexed: 11/26/2022] Open
Abstract
Goal: Most common diseases are influenced by multiple gene interactions and interactions with the environment. Performing an exhaustive search to identify such interactions is computationally expensive and needs to address the multiple testing problem. A four-step framework is proposed for the efficient identification of n-Way interactions. Methods: The framework was applied on a Multiple Sclerosis dataset with 725 subjects and 147 tagging SNPs. The first two steps of the framework are quality control and feature selection. The next step uses clustering and binary encodes the features. The final step performs the n-Way interaction testing. Results: The feature space was reduced to 7 SNPs and using the proposed binary encoding, more 2-SNP and 3-SNP interactions were identified compared to using the initial encoding. Conclusions: The framework selects informative features and with the proposed binary encoding it is able to identify more n-way interactions by increasing the power of the statistical analysis.
Collapse
Affiliation(s)
| | | | - Efthimios Dardiotis
- Department of Neurology, Faculty of MedicineUniversity of Thessaly Volos 38221 Greece
| | - Eleni Loizidou
- Department of Hygiene and EpidemiologyUniversity of Ioannina Ioannina 451 10 Greece
- Institute for BioinnovationBiomedical Sciences Research Center Alexander Fleming, Athens, 16672 Greece
| | - George Spyrou
- Bioinformatics Department and Cyprus School of Molecular MedicineCyprus Institute of Neurology and Genetics Nicosia 2371 Cyprus
| | - Christina Votsi
- Neurogenetics Department and Cyprus School of Molecular MedicineCyprus Institute of Neurology and Genetics Nicosia 2371 Cyprus
| | - Christodoulou Kyproula
- Neurogenetics Department and Cyprus School of Molecular MedicineCyprus Institute of Neurology and Genetics Nicosia 2371 Cyprus
| | - Marios Pantzaris
- Department of Neurology and Cyprus School of Molecular MedicineCyprus Institute of Neurology and Genetics Nicosia 2371 Cyprus
| | - Nikolaos Grigoriadis
- Department of Neurology IIAristotle University of Thessaloniki Thessaloniki 541 24 Greece
| | | | - Theodoros Kyriakides
- Department of Basic and Clinical SciencesMedical School University of Nicosia Nicosia 1678 Cyprus
| | - Constantinos Pattichi
- Department of Computer ScienceUniversity of Cyprus Nicosia 1678 Cyprus
- Biomedical Engineering Research CentreUniversity of Cyprus Nicosia 1678 Cyprus
| |
Collapse
|
14
|
Sun S, Dong B, Zou Q. Revisiting genome-wide association studies from statistical modelling to machine learning. Brief Bioinform 2020; 22:5943789. [PMID: 33126243 DOI: 10.1093/bib/bbaa263] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Revised: 09/06/2020] [Accepted: 09/11/2020] [Indexed: 11/14/2022] Open
Abstract
Over the last decade, genome-wide association studies (GWAS) have discovered thousands of genetic variants underlying complex human diseases and agriculturally important traits. These findings have been utilized to dissect the biological basis of diseases, to develop new drugs, to advance precision medicine and to boost breeding. However, the potential of GWAS is still underexploited due to methodological limitations. Many challenges have emerged, including detecting epistasis and single-nucleotide polymorphisms (SNPs) with small effects and distinguishing causal variants from other SNPs associated through linkage disequilibrium. These issues have motivated advancements in GWAS analyses in two contrasting cultures-statistical modelling and machine learning. In this review, we systematically present the basic concepts and the benefits and limitations in both methods. We further discuss recent efforts to mitigate their weaknesses. Additionally, we summarize the state-of-the-art tools for detecting the missed signals, ultrarare mutations and gene-gene interactions and for prioritizing SNPs. Our work can offer both theoretical and practical guidelines for performing GWAS analyses and for developing further new robust methods to fully exploit the potential of GWAS.
Collapse
Affiliation(s)
- Shanwen Sun
- Institute of Fundamental and Frontier Sciences at the University of Electronic Science and Technology of China, Chengdu, China
| | - Benzhi Dong
- College of Computer Science and Engineering, Northeast Forestry University, Harbin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences at the University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
15
|
You Y, Ru X, Lei W, Li T, Xiao M, Zheng H, Chen Y, Zhang L. Developing the novel bioinformatics algorithms to systematically investigate the connections among survival time, key genes and proteins for Glioblastoma multiforme. BMC Bioinformatics 2020; 21:383. [PMID: 32938364 PMCID: PMC7646399 DOI: 10.1186/s12859-020-03674-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Glioblastoma multiforme (GBM) is one of the most common malignant brain tumors and its average survival time is less than 1 year after diagnosis. RESULTS Firstly, this study aims to develop the novel survival analysis algorithms to explore the key genes and proteins related to GBM. Then, we explore the significant correlation between AEBP1 upregulation and increased EGFR expression in primary glioma, and employ a glioma cell line LN229 to identify relevant proteins and molecular pathways through protein network analysis. Finally, we identify that AEBP1 exerts its tumor-promoting effects by mainly activating mTOR pathway in Glioma. CONCLUSIONS We summarize the whole process of the experiment and discuss how to expand our experiment in the future.
Collapse
Affiliation(s)
- Yujie You
- College of Computer Science, Sichuan University, Chengdu, 610065 China
| | - Xufang Ru
- Department of Neurosurgery, Southwest Hospital, Third Military Medical University, Chongqing, P.R. China
| | - Wanjing Lei
- College of Computer Science, Sichuan University, Chengdu, 610065 China
| | - Tingting Li
- College of Mathematics and Statistics, Southwest University, Chongqing, 400715 P.R. China
| | - Ming Xiao
- College of Computer Science, Sichuan University, Chengdu, 610065 China
| | - Huiru Zheng
- School of Computing, Ulster University, Coleraine, Londonderry, Northern Ireland, UK
| | - Yujie Chen
- Department of Neurosurgery, Southwest Hospital, Third Military Medical University, Chongqing, P.R. China
| | - Le Zhang
- College of Computer Science, Sichuan University, Chengdu, 610065 China
| |
Collapse
|
16
|
Lynch SM, Handorf E, Sorice KA, Blackman E, Bealin L, Giri VN, Obeid E, Ragin C, Daly M. The effect of neighborhood social environment on prostate cancer development in black and white men at high risk for prostate cancer. PLoS One 2020; 15:e0237332. [PMID: 32790761 PMCID: PMC7425919 DOI: 10.1371/journal.pone.0237332] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Accepted: 07/23/2020] [Indexed: 12/24/2022] Open
Abstract
INTRODUCTION Neighborhood socioeconomic (nSES) factors have been implicated in prostate cancer (PCa) disparities. In line with the Precision Medicine Initiative that suggests clinical and socioenvironmental factors can impact PCa outcomes, we determined whether nSES variables are associated with time to PCa diagnosis and could inform PCa clinical risk assessment. MATERIALS AND METHODS The study sample included 358 high risk men (PCa family history and/or Black race), aged 35-69 years, enrolled in an early detection program. Patient variables were linked to 78 nSES variables (employment, income, etc.) from previous literature via geocoding. Patient-level models, including baseline age, prostate specific antigen (PSA), digital rectal exam, as well as combined models (patient plus nSES variables) by race/PCa family history subgroups were built after variable reduction methods using Cox regression and LASSO machine-learning. Model fit of patient and combined models (AIC) were compared; p-values<0.05 were significant. Model-based high/low nSES exposure scores were calculated and the 5-year predicted probability of PCa was plotted against PSA by high/low neighborhood score to preliminarily assess clinical relevance. RESULTS In combined models, nSES variables were significantly associated with time to PCa diagnosis. Workers mode of transportation and low income were significant in White men with a PCa family history. Homeownership (%owner-occupied houses with >3 bedrooms) and unemployment were significant in Black men with and without a PCa family history, respectively. The 5-year predicted probability of PCa was higher in men with a high neighborhood score (weighted combination of significant nSES variables) compared to a low score (e.g., Baseline PSA level of 4ng/mL for men with PCa family history: White-26.7% vs 7.7%; Black-56.2% vs 29.7%). DISCUSSION Utilizing neighborhood data during patient risk assessment may be useful for high risk men affected by disparities. However, future studies with larger samples and validation/replication steps are needed.
Collapse
Affiliation(s)
- Shannon M. Lynch
- Cancer Prevention and Control, Fox Chase Cancer Center, Philadelphia, Pennsylvania, United States of America
- * E-mail:
| | - Elizabeth Handorf
- Cancer Prevention and Control, Fox Chase Cancer Center, Philadelphia, Pennsylvania, United States of America
| | - Kristen A. Sorice
- Cancer Prevention and Control, Fox Chase Cancer Center, Philadelphia, Pennsylvania, United States of America
| | - Elizabeth Blackman
- Cancer Prevention and Control, Fox Chase Cancer Center, Philadelphia, Pennsylvania, United States of America
| | - Lisa Bealin
- Department of Clinical Genetics, Fox Chase Cancer Center, Philadelphia, Pennsylvania, United States of America
| | - Veda N. Giri
- Cancer Risk Assessment and Clinical Cancer Genetics Program, Departments of Medical Oncology, Cancer Biology, and Urology, Sidney Kimmel Cancer Center, Thomas Jefferson University, Philadelphia, Pennsylvania, United States of America
| | - Elias Obeid
- Cancer Prevention and Control, Fox Chase Cancer Center, Philadelphia, Pennsylvania, United States of America
- Department of Clinical Genetics, Fox Chase Cancer Center, Philadelphia, Pennsylvania, United States of America
| | - Camille Ragin
- Cancer Prevention and Control, Fox Chase Cancer Center, Philadelphia, Pennsylvania, United States of America
| | - Mary Daly
- Cancer Prevention and Control, Fox Chase Cancer Center, Philadelphia, Pennsylvania, United States of America
- Department of Clinical Genetics, Fox Chase Cancer Center, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
17
|
Le DH. Machine learning-based approaches for disease gene prediction. Brief Funct Genomics 2020; 19:350-363. [PMID: 32567652 DOI: 10.1093/bfgp/elaa013] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 04/30/2020] [Accepted: 05/09/2020] [Indexed: 12/20/2022] Open
Abstract
Disease gene prediction is an essential issue in biomedical research. In the early days, annotation-based approaches were proposed for this problem. With the development of high-throughput technologies, interaction data between genes/proteins have grown quickly and covered almost genome and proteome; thus, network-based methods for the problem become prominent. In parallel, machine learning techniques, which formulate the problem as a classification, have also been proposed. Here, we firstly show a roadmap of the machine learning-based methods for the disease gene prediction. In the beginning, the problem was usually approached using a binary classification, where positive and negative training sample sets are comprised of disease genes and non-disease genes, respectively. The disease genes are ones known to be associated with diseases; meanwhile, non-disease genes were randomly selected from those not yet known to be associated with diseases. However, the later may contain unknown disease genes. To overcome this uncertainty of defining the non-disease genes, more realistic approaches have been proposed for the problem, such as unary and semi-supervised classification. Recently, more advanced methods, including ensemble learning, matrix factorization and deep learning, have been proposed for the problem. Secondly, 12 representative machine learning-based methods for the disease gene prediction were examined and compared in terms of prediction performance and running time. Finally, their advantages, disadvantages, interpretability and trust were also analyzed and discussed.
Collapse
Affiliation(s)
- Duc-Hau Le
- Department of Computational Biomedicine, Vingroup Big Data Institute, Hanoi, Vietnam
| |
Collapse
|
18
|
Zhao B, Gabriel RA, Vaida F, Lopez NE, Eisenstein S, Clary BM. Predicting Overall Survival in Patients with Metastatic Rectal Cancer: a Machine Learning Approach. J Gastrointest Surg 2020; 24:1165-1172. [PMID: 31468331 PMCID: PMC7048666 DOI: 10.1007/s11605-019-04373-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/07/2019] [Accepted: 08/13/2019] [Indexed: 02/08/2023]
Abstract
BACKGROUND A significant proportion of patients with rectal cancer will present with synchronous metastasis at the time of diagnosis. Overall survival (OS) for these patients are highly variable and previous attempts to build predictive models often have low predictive power, with concordance indexes (c-index) less than 0.70. METHODS Using the National Cancer Database (2010-2014), we identified patients with synchronous metastatic rectal cancer. The data was split into a training dataset (diagnosis years 2010-2012), which was used to build the machine learning model, and a testing dataset (diagnosis years 2013-2014), which was used to externally validate the model. A nomogram predicting 3-year OS was created using Cox proportional hazard regression with lasso penalization. Predictors were selected based on clinical significance and availability in NCDB. Performance of the machine learning model was assessed by c-index. RESULTS A total of 4098 and 3107 patients were used to construct and validate the nomogram, respectively. Internally validated c-indexes at 1, 2, and 3 years were 0.816 (95% CI 0.813-0.818), 0.789 (95% CI 0.786-0.790), and 0.778 (95% CI 0.775-0.780), respectively. External validated c-indexes at 1, 2, and 3 years were 0.811, 0.779, and 0.778, respectively. CONCLUSIONS There is wide variability in the OS for patients with metastatic rectal cancer, making accurate predictions difficult. However, using machine learning techniques, more accurate models can be built. This will aid patients and clinicians in setting expectations and making clinical decisions in this group of challenging patients.
Collapse
Affiliation(s)
- Beiqun Zhao
- Department of Surgery, University of California San
Diego
| | | | - Florin Vaida
- Department of Family Medicine and Public Health,
University of California San Diego
| | | | | | - Bryan M. Clary
- Department of Surgery, University of California San
Diego
| |
Collapse
|
19
|
Wang Y, Zhu LN, Ma XW, Yang F, Xu XL, Yang Y, Yang X, Peng W, Zhang WQ, Liang JY, Zhu WD, Jiang TJ, Zhang XL, Feng ZC. Gene-Focused Networks Underlying Phenotypic Convergence in a Systematically Phenotyped Cohort With Heterogeneous Intellectual Disability. Front Bioeng Biotechnol 2020; 8:45. [PMID: 32117926 PMCID: PMC7019181 DOI: 10.3389/fbioe.2020.00045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 01/21/2020] [Indexed: 11/13/2022] Open
Abstract
The broad spectrum of intellectual disability (ID) patients' clinical manifestations, the heterogeneity of ID genetic variation, and the diversity of the phenotypic variation represent major challenges for ID diagnosis. By exploiting a manually curated systematic phenotyping cohort of 3803 patients harboring ID, we identified 704 pathogenic genes, 3848 pathogenic sites, and 2075 standard phenotypes for underlying molecular perturbations and their phenotypic impact. We found the positive correlation between the number of phenotypes and that of patients that revealed their extreme heterogeneities, and the relative contribution of multiple determinants to the heterogeneity of ID phenotypes. Nevertheless, despite the extreme heterogeneity in phenotypes, the ID genes had a specific bias of mutation types, and the top 44 genes that ranked by the number of patients accounted for 39.9% of total patients. More interesting, enriched co-occurrent phenotypes and co-occurrent phenotype networks for each gene had the potential for prioritizing ID genes, further exhibited the convergences of ID phenotypes. Then we established a predictor called IDpred using machine learning methods for ID pathogenic genes prediction. Using10-fold cross-validation, our evaluation shows remarkable AUC values for IDpred (auc = 0.978), demonstrating the robustness and reliability of our tool. Besides, we built the most comprehensive database of ID phenotyped cohort to date: IDminer http://218.4.234.74:3100/IDminer/, which included the curated ID data and integrated IDpred tool for both clinical and experimental researchers. The IDminer serves as an important resource and user-friendly interface to help researchers investigate ID data, and provide important implications for the diagnosis and pathogenesis of developmental disorders of cognition.
Collapse
Affiliation(s)
- Yan Wang
- BaYi Children’s Hospital, The Seventh Medical Center of PLA General Hospital, Beijing, China
- National Engineering Laboratory for Birth Defects Prevention and Control of Key Technology, Beijing, China
- Beijing Key Laboratory of Pediatric Organ Failure, Beijing, China
| | - Li-Na Zhu
- BaYi Children’s Hospital, The Seventh Medical Center of PLA General Hospital, Beijing, China
- National Engineering Laboratory for Birth Defects Prevention and Control of Key Technology, Beijing, China
- Beijing Key Laboratory of Pediatric Organ Failure, Beijing, China
| | - Xiu-Wei Ma
- BaYi Children’s Hospital, The Seventh Medical Center of PLA General Hospital, Beijing, China
- National Engineering Laboratory for Birth Defects Prevention and Control of Key Technology, Beijing, China
- Beijing Key Laboratory of Pediatric Organ Failure, Beijing, China
| | - Fang Yang
- Suzhou Institute of Systems Medicine, Chinese Academy of Medical Sciences, Suzhou, China
| | - Xi-Lin Xu
- Suzhou Institute of Systems Medicine, Chinese Academy of Medical Sciences, Suzhou, China
| | - Yao Yang
- BaYi Children’s Hospital, The Seventh Medical Center of PLA General Hospital, Beijing, China
- National Engineering Laboratory for Birth Defects Prevention and Control of Key Technology, Beijing, China
- Beijing Key Laboratory of Pediatric Organ Failure, Beijing, China
| | - Xiao Yang
- BaYi Children’s Hospital, The Seventh Medical Center of PLA General Hospital, Beijing, China
- National Engineering Laboratory for Birth Defects Prevention and Control of Key Technology, Beijing, China
- Beijing Key Laboratory of Pediatric Organ Failure, Beijing, China
| | - Wei Peng
- BaYi Children’s Hospital, The Seventh Medical Center of PLA General Hospital, Beijing, China
- National Engineering Laboratory for Birth Defects Prevention and Control of Key Technology, Beijing, China
- Beijing Key Laboratory of Pediatric Organ Failure, Beijing, China
| | - Wan-Qiao Zhang
- BaYi Children’s Hospital, The Seventh Medical Center of PLA General Hospital, Beijing, China
- National Engineering Laboratory for Birth Defects Prevention and Control of Key Technology, Beijing, China
- Beijing Key Laboratory of Pediatric Organ Failure, Beijing, China
| | - Jin-Yu Liang
- The Second People’s Hospital of Aohanqi, Inner Mongolia, China
| | - Wei-Dong Zhu
- The Second People’s Hospital of Aohanqi, Inner Mongolia, China
| | - Tai-Jiao Jiang
- Suzhou Institute of Systems Medicine, Chinese Academy of Medical Sciences, Suzhou, China
- Center of Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Xin-Lei Zhang
- Suzhou Geneworks Technology Co., Ltd., Suzhou, China
| | - Zhi-Chun Feng
- BaYi Children’s Hospital, The Seventh Medical Center of PLA General Hospital, Beijing, China
- National Engineering Laboratory for Birth Defects Prevention and Control of Key Technology, Beijing, China
- Beijing Key Laboratory of Pediatric Organ Failure, Beijing, China
| |
Collapse
|
20
|
Li X, Zhang S, Wong KC. Nature-Inspired Multiobjective Epistasis Elucidation from Genome-Wide Association Studies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:226-237. [PMID: 29994485 DOI: 10.1109/tcbb.2018.2849759] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In recent years, the detection of epistatic interactions of multiple genetic variants on the causes of complex diseases brings a significant challenge in genome-wide association studies (GWAS). However, most of the existing methods still suffer from algorithmic limitations such as single-objective optimization, intensive computational requirement, and premature convergence. In this paper, we propose and formulate an epistatic interaction multi-objective artificial bee colony algorithm based on decomposition (EIMOABC/D) to address those problems for genetic interaction detection in genome-wide association studies. First, to direct the genetic interaction detection, two objective functions are formulated to characterize various epistatic models; rank probability model is proposed to sort each population into different nondomination levels based on the fast nondominated sorting approach. After that, the mutual information based local search algorithm is proposed to guide the population search for disease model evaluations in an unbiased manner. To validate the effectiveness of EIMOABC/D, we compare EIMOABC/D against seven state-of-the-art methods on 77 epistatic models including eight small-scale epistatic models with marginal effects, eight large-scale epistatic models with marginal effects, 60 large-scale epistatic models without any marginal effect, and one case study. The experimental results indicate that our proposed algorithm EIMOABC/D outperforms seven state-of-the-art methods on those epistatic models. Furthermore, time complexity analysis and parameter analysis are conducted to demonstrate various properties of our proposed algorithm.
Collapse
|
21
|
Bi Q, Goodman KE, Kaminsky J, Lessler J. What is Machine Learning? A Primer for the Epidemiologist. Am J Epidemiol 2019; 188:2222-2239. [PMID: 31509183 DOI: 10.1093/aje/kwz189] [Citation(s) in RCA: 110] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2018] [Revised: 07/29/2019] [Accepted: 08/14/2019] [Indexed: 12/22/2022] Open
Abstract
Machine learning is a branch of computer science that has the potential to transform epidemiologic sciences. Amid a growing focus on "Big Data," it offers epidemiologists new tools to tackle problems for which classical methods are not well-suited. In order to critically evaluate the value of integrating machine learning algorithms and existing methods, however, it is essential to address language and technical barriers between the two fields that can make it difficult for epidemiologists to read and assess machine learning studies. Here, we provide an overview of the concepts and terminology used in machine learning literature, which encompasses a diverse set of tools with goals ranging from prediction to classification to clustering. We provide a brief introduction to 5 common machine learning algorithms and 4 ensemble-based approaches. We then summarize epidemiologic applications of machine learning techniques in the published literature. We recommend approaches to incorporate machine learning in epidemiologic research and discuss opportunities and challenges for integrating machine learning and existing epidemiologic research methods.
Collapse
Affiliation(s)
- Qifang Bi
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland
| | - Katherine E Goodman
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland
| | - Joshua Kaminsky
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland
| | - Justin Lessler
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland
| |
Collapse
|
22
|
Cao X, Liu J, Guo M, Wang J. HiSSI: high-order SNP-SNP interactions detection based on efficient significant pattern and differential evolution. BMC Med Genomics 2019; 12:139. [PMID: 31888641 PMCID: PMC6936079 DOI: 10.1186/s12920-019-0584-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Accepted: 09/10/2019] [Indexed: 11/10/2022] Open
Abstract
Background Detecting single nucleotide polymorphism (SNP) interactions is an important and challenging task in genome-wide association studies (GWAS). Various efforts have been devoted to detect SNP interactions. However, the large volume of SNP datasets results in such a big number of high-order SNP combinations that restrict the power of detecting interactions. Methods In this paper, to combat with this challenge, we propose a two-stage approach (called HiSSI) to detect high-order SNP-SNP interactions. In the screening stage, HiSSI employs a statistically significant pattern that takes into account family wise error rate, to control false positives and to effectively screen two-locus combinations candidate set. In the searching stage, HiSSI applies two different search strategies (exhaustive search and heuristic search based on differential evolution along with χ2-test) on candidate pairwise SNP combinations to detect high-order SNP interactions. Results Extensive experiments on simulated datasets are conducted to evaluate HiSSI and recently proposed and related approaches on both two-locus and three-locus disease models. A real genome-wide dataset: breast cancer dataset collected from the Wellcome Trust Case Control Consortium (WTCCC) is also used to test HiSSI. Conclusions Simulated experiments on both two-locus and three-locus disease models show that HiSSI is more powerful than other related approaches. Real experiment on breast cancer dataset, in which HiSSI detects some significantly two-locus and three-locus interactions associated with breast cancer, again corroborate the effectiveness of HiSSI in high-order SNP-SNP interaction identification.
Collapse
Affiliation(s)
- Xia Cao
- College of Computer and Information Science, Southwest University, Beibei, Chongqing, 400715, China
| | - Jie Liu
- College of Computer and Information Science, Southwest University, Beibei, Chongqing, 400715, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, 100044, China.,Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing, 100044, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Beibei, Chongqing, 400715, China.
| |
Collapse
|
23
|
Liu D, Wang M, Yuan Y, Schwender H, Wang H, Wang P, Zhou Z, Li J, Wu T, Zhu H, Beaty TH. Gene-gene interaction among cell adhesion genes and risk of nonsyndromic cleft lip with or without cleft palate in Chinese case-parent trios. Mol Genet Genomic Med 2019; 7:e00872. [PMID: 31419083 PMCID: PMC6785639 DOI: 10.1002/mgg3.872] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Revised: 05/27/2019] [Accepted: 07/08/2019] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND Nonsyndromic cleft lip with or without cleft palate (NSCL/P) is a common birth defect with complex etiology. One strategy for studying the genetic risk factors of NSCL/P is to consider gene-gene interaction (G × G) among gene pathways having a role in craniofacial development. The present study aimed to investigate the G × G among cell adhesion gene pathway. METHODS We carried out an interaction analysis of eight genes involved in cell adherens junctions among 806 NSCL/P Chinese case-parent trios originally recruited for a genome-wide association study (GWAS). Regression-based approach was used to test for two-way G × G interaction, while machine learning algorithm was run for exploring both two-way and multi-way interaction that may affect the risk of NSCL/P. RESULTS A two-way ACTN1 × CTNNB1 interaction reached the adjusted significance level. The single nucleotide polymorphisms pair composed of rs17252114 (CTNNB1) and rs1274944 (ACTN1) yielded a p value of .0002, and this interaction was also supported by the logic regression algorithm. Higher order interactions involving ACTN1, CTNNB1, and CDH1 were picked out by logic regression, suggesting a potential role in NSCL/P risk. CONCLUSION This study suggests for the first time evidence of both two-way and multi-way G × G interactions among cell adhesion genes contributing to the NSCL/P risk.
Collapse
Affiliation(s)
- Dongjing Liu
- School of Public HealthPeking UniversityBeijingChina
| | - Mengying Wang
- School of Public HealthPeking UniversityBeijingChina
| | - Yuan Yuan
- School of Public HealthPeking UniversityBeijingChina
| | - Holger Schwender
- Mathematical InstituteHeinrich Heine University DuesseldorfDuesseldorfGermany
| | - Hong Wang
- School of Public HealthPeking UniversityBeijingChina
| | - Ping Wang
- Beijing Center for Disease Prevention and ControlBeijingChina
| | - Zhibo Zhou
- School of StomatologyPeking UniversityBeijingChina
| | - Jing Li
- School of StomatologyPeking UniversityBeijingChina
| | - Tao Wu
- School of Public HealthPeking UniversityBeijingChina
- Key Laboratory of Reproductive HealthMinistry of HealthBeijingChina
| | - Hongping Zhu
- School of StomatologyPeking UniversityBeijingChina
| | - Terri H. Beaty
- School of Public HealthJohns Hopkins UniversityBaltimoreMarylandUSA
| |
Collapse
|
24
|
Ansarifar J, Wang L. New algorithms for detecting multi-effect and multi-way epistatic interactions. Bioinformatics 2019; 35:5078-5085. [DOI: 10.1093/bioinformatics/btz463] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Revised: 04/14/2019] [Accepted: 05/31/2019] [Indexed: 11/14/2022] Open
Abstract
AbstractMotivationEpistasis, which is the phenomenon of genetic interactions, plays a central role in many scientific discoveries. However, due to the combinatorial nature of the problem, it is extremely challenging to decipher the exact combinations of genes that trigger the epistatic effects. Many existing methods only focus on two-way interactions. Some of the most effective methods used machine learning techniques, but many were designed for special case-and-control studies or suffer from overfitting. We propose three new algorithms for multi-effect and multi-way epistases detection, with one guaranteeing global optimality and the other two being local optimization oriented heuristics.ResultsThe computational performance of the proposed heuristic algorithm was compared with several state-of-the-art methods using a yeast dataset. Results suggested that searching for the global optimal solution could be extremely time consuming, but the proposed heuristic algorithm was much more effective and efficient than others at finding a close-to-optimal solution. Moreover, it was able to provide biological insight on the exact configurations of epistases, besides achieving a higher prediction accuracy than the state-of-the-art methods.Availability and implementationData source was publicly available and details are provided in the text.
Collapse
|
25
|
Machine learning technology in the application of genome analysis: A systematic review. Gene 2019; 705:149-156. [PMID: 31026571 DOI: 10.1016/j.gene.2019.04.062] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 04/17/2019] [Accepted: 04/22/2019] [Indexed: 01/17/2023]
Abstract
Machine learning (ML) is a powerful technique to tackle many problems in data mining and predictive analytics. We believe that ML will be of considerable potentials in the field of bioinformatics since the high-throughput technology is producing ever increasing biological data. In this review, we summarized major ML algorithms and conditions that must be paid attention to when applying these algorithms to genomic problems in details and we provided a list of examples from different perspectives and data analysis challenges at present.
Collapse
|
26
|
A Machine Learning Approach to Predicting Case Duration for Robot-Assisted Surgery. J Med Syst 2019; 43:32. [PMID: 30612192 DOI: 10.1007/s10916-018-1151-y] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2018] [Accepted: 12/25/2018] [Indexed: 01/22/2023]
Abstract
Robot-assisted surgery (RAS) requires a large capital investment by healthcare organizations. The cost of a robotic unit is fixed, so institutions must maximize use of each unit by utilizing all available operating room block time. One way to increase utilization is to accurately predict case durations. In this study, we sought to use machine learning to develop an accurate predictive model for RAS case duration. We analyzed a random sample of robotic cases at our institution from January 2014 to June 2017. We compared the machine learning models to the baseline model, which is the scheduled case duration (determined by previous case duration averages and surgeon adjustments). Specifically, we used: 1) multivariable linear regression, 2) ridge regression, 3) lasso regression, 4) random forest, 5) boosted regression tree, and 6) neural network. We found that all machine learning models decreased the average root-mean-squared error (RMSE) as compared to the baseline model. The average RMSE was lowest with the boosted regression tree (80.2 min, 95% CI 74.0-86.4), which was significantly lower than the baseline model (100.4 min, 95% CI 90.5-110.3). Using boosted regression tree, we can increase the number of accurately booked cases from 148 to 219 (34.9% to 51.7%, p < 0.001). This study shows that using various machine learning approaches can improve the accuracy of RAS case length predictions, which will increase utilization of this limited resource. Further work is needed to operationalize these findings.
Collapse
|
27
|
Improving pharmacogenetic prediction of extrapyramidal symptoms induced by antipsychotics. Transl Psychiatry 2018; 8:276. [PMID: 30546092 PMCID: PMC6293322 DOI: 10.1038/s41398-018-0330-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/22/2018] [Revised: 10/15/2018] [Accepted: 11/13/2018] [Indexed: 11/30/2022] Open
Abstract
In previous work we developed a pharmacogenetic predictor of antipsychotic (AP) induced extrapyramidal symptoms (EPS) based on four genes involved in mTOR regulation. The main objective is to improve this predictor by increasing its biological plausibility and replication. We re-sequence the four genes using next-generation sequencing. We predict functionality "in silico" of all identified SNPs and test it using gene reporter assays. Using functional SNPs, we develop a new predictor utilizing machine learning algorithms (Discovery Cohort, N = 131) and replicate it in two independent cohorts (Replication Cohort 1, N = 113; Replication Cohort 2, N = 113). After prioritization, four SNPs were used to develop the pharmacogenetic predictor of AP-induced EPS. The model constructed using the Naive Bayes algorithm achieved a 66% of accuracy in the Discovery Cohort, and similar performances in the replication cohorts. The result is an improved pharmacogenetic predictor of AP-induced EPS, which is more robust and generalizable than the original.
Collapse
|
28
|
Dorani F, Hu T, Woods MO, Zhai G. Ensemble learning for detecting gene-gene interactions in colorectal cancer. PeerJ 2018; 6:e5854. [PMID: 30397551 PMCID: PMC6211269 DOI: 10.7717/peerj.5854] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2018] [Accepted: 09/28/2018] [Indexed: 11/20/2022] Open
Abstract
Colorectal cancer (CRC) has a high incident rate in both men and women and is affecting millions of people every year. Genome-wide association studies (GWAS) on CRC have successfully revealed common single-nucleotide polymorphisms (SNPs) associated with CRC risk. However, they can only explain a very limited fraction of the disease heritability. One reason may be the common uni-variable analyses in GWAS where genetic variants are examined one at a time. Given the complexity of cancers, the non-additive interaction effects among multiple genetic variants have a potential of explaining the missing heritability. In this study, we employed two powerful ensemble learning algorithms, random forests and gradient boosting machine (GBM), to search for SNPs that contribute to the disease risk through non-additive gene-gene interactions. We were able to find 44 possible susceptibility SNPs that were ranked most significant by both algorithms. Out of those 44 SNPs, 29 are in coding regions. The 29 genes include ARRDC5, DCC, ALK, and ITGA1, which have been found previously associated with CRC, and E2F3 and NID2, which are potentially related to CRC since they have known associations with other types of cancer. We performed pairwise and three-way interaction analysis on the 44 SNPs using information theoretical techniques and found 17 pairwise (p < 0.02) and 16 three-way (p ≤ 0.001) interactions among them. Moreover, functional enrichment analysis suggested 16 functional terms or biological pathways that may help us better understand the etiology of the disease.
Collapse
Affiliation(s)
- Faramarz Dorani
- Department of Computer Science, Memorial University, St. John's, Newfoundland and Labrador, Canada
| | - Ting Hu
- Department of Computer Science, Memorial University, St. John's, Newfoundland and Labrador, Canada
| | - Michael O Woods
- Faculty of Medicine, Memorial University, St. John's, Newfoundland and Labrador, Canada
| | - Guangju Zhai
- Faculty of Medicine, Memorial University, St. John's, Newfoundland and Labrador, Canada
| |
Collapse
|
29
|
Lopez C, Tucker S, Salameh T, Tucker C. An unsupervised machine learning method for discovering patient clusters based on genetic signatures. J Biomed Inform 2018; 85:30-39. [PMID: 30016722 PMCID: PMC6621561 DOI: 10.1016/j.jbi.2018.07.004] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Revised: 06/22/2018] [Accepted: 07/07/2018] [Indexed: 01/04/2023]
Abstract
INTRODUCTION Many chronic disorders have genomic etiology, disease progression, clinical presentation, and response to treatment that vary on a patient-to-patient basis. Such variability creates a need to identify characteristics within patient populations that have clinically relevant predictive value in order to advance personalized medicine. Unsupervised machine learning methods are suitable to address this type of problem, in which no a priori class label information is available to guide this search. However, it is challenging for existing methods to identify cluster memberships that are not just a result of natural sampling variation. Moreover, most of the current methods require researchers to provide specific input parameters a priori. METHOD This work presents an unsupervised machine learning method to cluster patients based on their genomic makeup without providing input parameters a priori. The method implements internal validity metrics to algorithmically identify the number of clusters, as well as statistical analyses to test for the significance of the results. Furthermore, the method takes advantage of the high degree of linkage disequilibrium between single nucleotide polymorphisms. Finally, a gene pathway analysis is performed to identify potential relationships between the clusters in the context of known biological knowledge. DATASETS AND RESULTS The method is tested with a cluster validation and a genomic dataset previously used in the literature. Benchmark results indicate that the proposed method provides the greatest performance out of the methods tested. Furthermore, the method is implemented on a sample genome-wide study dataset of 191 multiple sclerosis patients. The results indicate that the method was able to identify genetically distinct patient clusters without the need to select parameters a priori. Additionally, variants identified as significantly different between clusters are shown to be enriched for protein-protein interactions, especially in immune processes and cell adhesion pathways, via Gene Ontology term analysis. CONCLUSION Once links are drawn between clusters and clinically relevant outcomes, Immunochip data can be used to classify high-risk and newly diagnosed chronic disease patients into known clusters for predictive value. Further investigation can extend beyond pathway analysis to evaluate these clusters for clinical significance of genetically related characteristics such as age of onset, disease course, heritability, and response to treatment.
Collapse
Affiliation(s)
- Christian Lopez
- Industrial and Manufacturing Engineering, The Pennsylvania State University, University Park, PA 16802, USA
| | - Scott Tucker
- Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, USA; Engineering Science and Mechanics, The Pennsylvania State University, University Park, PA 16802, USA
| | - Tarik Salameh
- Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, USA
| | - Conrad Tucker
- Industrial and Manufacturing Engineering, The Pennsylvania State University, University Park, PA 16802, USA; Engineering Design Technology and Professional Programs, The Pennsylvania State University, University Park, PA 16802, USA; Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA.
| |
Collapse
|
30
|
Fang YH, Wang JH, Hsiung CA. TSGSIS: a high-dimensional grouped variable selection approach for detection of whole-genome SNP-SNP interactions. Bioinformatics 2018. [PMID: 28651334 DOI: 10.1093/bioinformatics/btx409] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Motivation Identification of single nucleotide polymorphism (SNP) interactions is an important and challenging topic in genome-wide association studies (GWAS). Many approaches have been applied to detecting whole-genome interactions. However, these approaches to interaction analysis tend to miss causal interaction effects when the individual marginal effects are uncorrelated to trait, while their interaction effects are highly associated with the trait. Results A grouped variable selection technique, called two-stage grouped sure independence screening (TS-GSIS), is developed to study interactions that may not have marginal effects. The proposed TS-GSIS is shown to be very helpful in identifying not only causal SNP effects that are uncorrelated to trait but also their corresponding SNP-SNP interaction effects. The benefit of TS-GSIS are gaining detection of interaction effects by taking the joint information among the SNPs and determining the size of candidate sets in the model. Simulation studies under various scenarios are performed to compare performance of TS-GSIS and current approaches. We also apply our approach to a real rheumatoid arthritis (RA) dataset. Both the simulation and real data studies show that the TS-GSIS performs very well in detecting SNP-SNP interactions. Availability and implementation R-package is delivered through CRAN and is available at: https://cran.r-project.org/web/packages/TSGSIS/index.html. Contact hsiung@nhri.org.tw. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yao-Hwei Fang
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan 35053, Taiwan
| | - Jie-Huei Wang
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan 35053, Taiwan
| | - Chao A Hsiung
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan 35053, Taiwan
| |
Collapse
|
31
|
ClusterMI: Detecting High-Order SNP Interactions Based on Clustering and Mutual Information. Int J Mol Sci 2018; 19:ijms19082267. [PMID: 30072632 PMCID: PMC6121365 DOI: 10.3390/ijms19082267] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 07/23/2018] [Accepted: 07/30/2018] [Indexed: 01/14/2023] Open
Abstract
Identifying single nucleotide polymorphism (SNP) interactions is considered as a popular and crucial way for explaining the missing heritability of complex diseases in genome-wide association studies (GWAS). Many approaches have been proposed to detect SNP interactions. However, existing approaches generally suffer from the high computational complexity resulting from the explosion of candidate high-order interactions. In this paper, we propose a two-stage approach (called ClusterMI) to detect high-order genome-wide SNP interactions based on significant pairwise SNP combinations. In the screening stage, to alleviate the huge computational burden, ClusterMI firstly applies a clustering algorithm combined with mutual information to divide SNPs into different clusters. Then, ClusterMI utilizes conditional mutual information to screen significant pairwise SNP combinations in each cluster. In this way, there is a higher probability of identifying significant two-locus combinations in each group, and the computational load for the follow-up search can be greatly reduced. In the search stage, two different search strategies (exhaustive search and improved ant colony optimization search) are provided to detect high-order SNP interactions based on the cardinality of significant two-locus combinations. Extensive simulation experiments show that ClusterMI has better performance than other related and competitive approaches. Experiments on two real case-control datasets from Wellcome Trust Case Control Consortium (WTCCC) also demonstrate that ClusterMI is more capable of identifying high-order SNP interactions from genome-wide data.
Collapse
|
32
|
Machine learning-based identification of genetic interactions from heterogeneous gene expression profiles. PLoS One 2018; 13:e0201056. [PMID: 30048494 PMCID: PMC6062065 DOI: 10.1371/journal.pone.0201056] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Accepted: 07/06/2018] [Indexed: 02/02/2023] Open
Abstract
The identification of disease-related genes and disease mechanisms is an important research goal; many studies have approached this problem by analysing genetic networks based on gene expression profiles and interaction datasets. To construct a gene network, correlations or associations among pairs of genes must be obtained. However, when gene expression data are heterogeneous with high levels of noise for samples assigned to the same condition, it is difficult to accurately determine whether a gene pair represents a significant gene-gene interaction (GGI). In order to solve this problem, we proposed a random forest-based method to classify significant GGIs from gene expression data. To train the model, we defined novel feature sets and utilised various high-confidence interactome datasets to deduce the correct answer set from known disease-specific genes. Using Alzheimer's disease data, the proposed method showed remarkable accuracy, and the GGIs established in the analysis can be used to build a meaningful genetic network that can explain the mechanisms underlying Alzheimer's disease.
Collapse
|
33
|
Abstract
Abstract
Next Generation Sequencing (NGS) or deep sequencing technology enables parallel reading of multiple individual DNA fragments, thereby enabling the identification of millions of base pairs in several hours. Recent research has clearly shown that machine learning technologies can efficiently analyse large sets of genomic data and help to identify novel gene functions and regulation regions. A deep artificial neural network consists of a group of artificial neurons that mimic the properties of living neurons. These mathematical models, termed Artificial Neural Networks (ANN), can be used to solve artificial intelligence engineering problems in several different technological fields (e.g., biology, genomics, proteomics, and metabolomics). In practical terms, neural networks are non-linear statistical structures that are organized as modelling tools and are used to simulate complex genomic relationships between inputs and outputs. To date, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNN) have been demonstrated to be the best tools for improving performance in problem solving tasks within the genomic field.
Collapse
|
34
|
Ahmad F, Debes PV, Palomar G, Vasemägi A. Association mapping reveals candidate loci for resistance and anaemic response to an emerging temperature-driven parasitic disease in a wild salmonid fish. Mol Ecol 2018; 27:1385-1401. [PMID: 29411465 DOI: 10.1111/mec.14509] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2017] [Accepted: 01/08/2018] [Indexed: 02/06/2023]
Abstract
Even though parasitic infections are often costly or deadly for the host, we know very little which genes influence parasite susceptibility and disease severity. Proliferative kidney disease is an emerging and, at elevated water temperatures, potentially deadly disease of salmonid fishes that is caused by the myxozoan parasite Tetracapsuloides bryosalmonae. By screening >7.6 K SNPs in 255 wild brown trout (Salmo trutta) and combining association mapping and Random Forest approaches, we identified several candidate genes for both the parasite resistance (inverse of relative parasite load; RPL) and the severe anaemic response to the parasite. The strongest RPL-associated SNP mapped to a noncoding region of the congeneric Atlantic salmon (S. salar) chromosome 10, whereas the second strongest RPL-associated SNP mapped to an intronic region of PRICKLE2 gene, which is a part of the planar cell polarity signalling pathway involved in kidney development. The top SNP associated with anaemia mapped to the intron of the putative PRKAG2 gene. The human ortholog of this gene has been associated with haematocrit and other blood-related traits, making it a prime candidate influencing parasite-triggered anaemia in brown trout. Our findings demonstrate the power of association mapping to pinpoint genomic regions and potential causative genes underlying climate change-driven parasitic disease resistance and severity. Furthermore, this work illustrates the first steps towards dissecting genotype-phenotype links in a wild fish population using closely related genome information.
Collapse
Affiliation(s)
- F Ahmad
- Department of Biology, University of Turku, Turku, Finland
| | - P V Debes
- Department of Biology, University of Turku, Turku, Finland.,Department of Biosciences, University of Helsinki, Helsinki, Finland
| | - G Palomar
- Research Unit of Biodiversity (UO-CSIC-PA), Mieres, Asturias, Spain.,Department of Biology of Organisms and Systems, University of Oviedo, Oviedo, Asturias, Spain
| | - A Vasemägi
- Department of Biology, University of Turku, Turku, Finland.,Chair of Aquaculture, Institute of Veterinary Medicine and Animal Sciences, Estonian University of Life Sciences, Tartu, Estonia
| |
Collapse
|
35
|
Uppu S, Krishna A, Gopalan RP. A Review on Methods for Detecting SNP Interactions in High-Dimensional Genomic Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:599-612. [PMID: 28060710 DOI: 10.1109/tcbb.2016.2635125] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
In this era of genome-wide association studies (GWAS), the quest for understanding the genetic architecture of complex diseases is rapidly increasing more than ever before. The development of high throughput genotyping and next generation sequencing technologies enables genetic epidemiological analysis of large scale data. These advances have led to the identification of a number of single nucleotide polymorphisms (SNPs) responsible for disease susceptibility. The interactions between SNPs associated with complex diseases are increasingly being explored in the current literature. These interaction studies are mathematically challenging and computationally complex. These challenges have been addressed by a number of data mining and machine learning approaches. This paper reviews the current methods and the related software packages to detect the SNP interactions that contribute to diseases. The issues that need to be considered when developing these models are addressed in this review. The paper also reviews the achievements in data simulation to evaluate the performance of these models. Further, it discusses the future of SNP interaction analysis.
Collapse
|
36
|
Rosellini AJ, Stein MB, Benedek DM, Bliese PD, Chiu WT, Hwang I, Monahan J, Nock MK, Petukhova MV, Sampson NA, Street AE, Zaslavsky AM, Ursano RJ, Kessler RC. Using self-report surveys at the beginning of service to develop multi-outcome risk models for new soldiers in the U.S. Army. Psychol Med 2017; 47:2275-2287. [PMID: 28374665 PMCID: PMC5679702 DOI: 10.1017/s003329171700071x] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
BACKGROUND The U.S. Army uses universal preventives interventions for several negative outcomes (e.g. suicide, violence, sexual assault) with especially high risks in the early years of service. More intensive interventions exist, but would be cost-effective only if targeted at high-risk soldiers. We report results of efforts to develop models for such targeting from self-report surveys administered at the beginning of Army service. METHODS 21 832 new soldiers completed a self-administered questionnaire (SAQ) in 2011-2012 and consented to link administrative data to SAQ responses. Penalized regression models were developed for 12 administratively-recorded outcomes occurring by December 2013: suicide attempt, mental hospitalization, positive drug test, traumatic brain injury (TBI), other severe injury, several types of violence perpetration and victimization, demotion, and attrition. RESULTS The best-performing models were for TBI (AUC = 0.80), major physical violence perpetration (AUC = 0.78), sexual assault perpetration (AUC = 0.78), and suicide attempt (AUC = 0.74). Although predicted risk scores were significantly correlated across outcomes, prediction was not improved by including risk scores for other outcomes in models. Of particular note: 40.5% of suicide attempts occurred among the 10% of new soldiers with highest predicted risk, 57.2% of male sexual assault perpetrations among the 15% with highest predicted risk, and 35.5% of female sexual assault victimizations among the 10% with highest predicted risk. CONCLUSIONS Data collected at the beginning of service in self-report surveys could be used to develop risk models that define small proportions of new soldiers accounting for high proportions of negative outcomes over the first few years of service.
Collapse
Affiliation(s)
- Anthony J. Rosellini
- Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
| | - Murray B. Stein
- Departments of Psychiatry and Family Medicine & Public Health, University of California San Diego, La Jolla, California, USA
- VA San Diego Healthcare System, San Diego, CA, USA
| | - David M. Benedek
- Center for the Study of Traumatic Stress, Department of Psychiatry, Uniformed Services University School of Medicine, Bethesda, MD, USA
| | - Paul D. Bliese
- Darla Moore School of Business, University of South Carolina, Columbia, South Carolina, USA
| | - Wai Tat Chiu
- Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
| | - Irving Hwang
- Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
| | - John Monahan
- School of Law, University of Virginia, Charlottesville, VA, USA
| | - Matthew K. Nock
- Department of Psychology, Harvard University, Cambridge, Massachusetts, USA
| | - Maria V. Petukhova
- Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
| | - Nancy A. Sampson
- Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
| | - Amy E. Street
- National Center for PTSD, VA Boston Healthcare System, Boston, Massachusetts, USA
- Department of Psychiatry, Boston University School of Medicine, Boston, Massachusetts, USA
| | - Alan M. Zaslavsky
- Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
| | - Robert J. Ursano
- Center for the Study of Traumatic Stress, Department of Psychiatry, Uniformed Services University School of Medicine, Bethesda, MD, USA
| | - Ronald C. Kessler
- Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
37
|
Kessler RC, Hwang I, Hoffmire CA, McCarthy JF, Petukhova MV, Rosellini AJ, Sampson NA, Schneider AL, Bradley PA, Katz IR, Thompson C, Bossarte RM. Developing a practical suicide risk prediction model for targeting high-risk patients in the Veterans health Administration. Int J Methods Psychiatr Res 2017; 26:e1575. [PMID: 28675617 PMCID: PMC5614864 DOI: 10.1002/mpr.1575] [Citation(s) in RCA: 124] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/03/2017] [Revised: 05/25/2017] [Accepted: 05/30/2017] [Indexed: 01/19/2023] Open
Abstract
OBJECTIVES The US Veterans Health Administration (VHA) has begun using predictive modeling to identify Veterans at high suicide risk to target care. Initial analyses are reported here. METHODS A penalized logistic regression model was compared with an earlier proof-of-concept logistic model. Exploratory analyses then considered commonly-used machine learning algorithms. Analyses were based on electronic medical records for all 6,360 individuals classified in the National Death Index as having died by suicide in fiscal years 2009-2011 who used VHA services the year of their death or prior year and a 1% probability sample of time-matched VHA service users alive at the index date (n = 2,112,008). RESULTS A penalized logistic model with 61 predictors had sensitivity comparable to the proof-of-concept model (which had 381 predictors) at target thresholds. The machine learning algorithms had relatively similar sensitivities, the highest being for Bayesian additive regression trees, with 10.7% of suicides occurred among the 1.0% of Veterans with highest predicted risk and 28.1% among the 5.0% of with highest predicted risk. CONCLUSIONS Based on these results, VHA is using penalized logistic regression in initial intervention implementation. The paper concludes with a discussion of other practical issues that might be explored to increase model performance.
Collapse
Affiliation(s)
- Ronald C Kessler
- Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
| | - Irving Hwang
- Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
| | - Claire A Hoffmire
- VISN 19 Mental Illness Research, Education and Clinical Care Center, Denver, Colorado, USA
| | - John F McCarthy
- Office of Mental Health Operations, VA Center for Clinical Management Research, Serious Mental Illness Treatment Resource and Evaluation Center, Ann Arbor, Michigan, USA
| | - Maria V Petukhova
- Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
| | - Anthony J Rosellini
- Center for Anxiety and Related Disorders, Boston University, Boston, Massachusetts, USA
| | - Nancy A Sampson
- Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
| | - Alexandra L Schneider
- VISN 19 Mental Illness Research, Education and Clinical Care Center, Denver, Colorado, USA
| | - Paul A Bradley
- PricewaterhouseCoopers PS LLP, Washington, District of Columbia, USA
| | - Ira R Katz
- Office of Mental Health Operations, Veterans Health Administration, Washington, District of Columbia, USA
| | - Caitlin Thompson
- Office of Suicide Prevention, Veterans Health Administration, Washington, District of Columbia, USA.,Department of Psychiatry, University of Rochester, Rochester, New York, USA
| | - Robert M Bossarte
- West Virginia University Injury Control Research Center and Department of Behavioral Medicine and Psychiatry, West Virginia University School of Medicine, Morgantown, West Virginia, USA.,Office of Suicide Prevention and VISN 2 Center of Excellence for Suicide Prevention, Veterans Health Administration, Washington, District of Columbia, USA
| |
Collapse
|
38
|
|
39
|
Abo Alchamlat S, Farnir F. KNN-MDR: a learning approach for improving interactions mapping performances in genome wide association studies. BMC Bioinformatics 2017; 18:184. [PMID: 28327091 PMCID: PMC5361736 DOI: 10.1186/s12859-017-1599-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2016] [Accepted: 03/11/2017] [Indexed: 12/30/2022] Open
Abstract
Background Finding epistatic interactions in large association studies like genome-wide association studies (GWAS) with the nowadays-available large volume of genomic data is a challenging and largely unsolved issue. Few previous studies could handle genome-wide data due to the intractable difficulties met in searching a combinatorial explosive search space and statistically evaluating epistatic interactions given a limited number of samples. Our work is a contribution to this field. We propose a novel approach combining K-Nearest Neighbors (KNN) and Multi Dimensional Reduction (MDR) methods for detecting gene-gene interactions as a possible alternative to existing algorithms, e especially in situations where the number of involved determinants is high. After describing the approach, a comparison of our method (KNN-MDR) to a set of the other most performing methods (i.e., MDR, BOOST, BHIT, MegaSNPHunter and AntEpiSeeker) is carried on to detect interactions using simulated data as well as real genome-wide data. Results Experimental results on both simulated data and real genome-wide data show that KNN-MDR has interesting properties in terms of accuracy and power, and that, in many cases, it significantly outperforms its recent competitors. Conclusions The presented methodology (KNN-MDR) is valuable in the context of loci and interactions mapping and can be seen as an interesting addition to the arsenal used in complex traits analyses. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1599-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sinan Abo Alchamlat
- Department of Biostatistics, Faculty of Veterinary Medicine, FARAH, University of Liège, Sart Tilman B43, 4000, Liege, Belgium
| | - Frédéric Farnir
- Department of Biostatistics, Faculty of Veterinary Medicine, FARAH, University of Liège, Sart Tilman B43, 4000, Liege, Belgium.
| |
Collapse
|
40
|
Using patient self-reports to study heterogeneity of treatment effects in major depressive disorder. Epidemiol Psychiatr Sci 2017; 26:22-36. [PMID: 26810628 PMCID: PMC5125904 DOI: 10.1017/s2045796016000020] [Citation(s) in RCA: 112] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
BACKGROUNDS Clinicians need guidance to address the heterogeneity of treatment responses of patients with major depressive disorder (MDD). While prediction schemes based on symptom clustering and biomarkers have so far not yielded results of sufficient strength to inform clinical decision-making, prediction schemes based on big data predictive analytic models might be more practically useful. METHOD We review evidence suggesting that prediction equations based on symptoms and other easily-assessed clinical features found in previous research to predict MDD treatment outcomes might provide a foundation for developing predictive analytic clinical decision support models that could help clinicians select optimal (personalised) MDD treatments. These methods could also be useful in targeting patient subsamples for more expensive biomarker assessments. RESULTS Approximately two dozen baseline variables obtained from medical records or patient reports have been found repeatedly in MDD treatment trials to predict overall treatment outcomes (i.e., intervention v. control) or differential treatment outcomes (i.e., intervention A v. intervention B). Similar evidence has been found in observational studies of MDD persistence-severity. However, no treatment studies have yet attempted to develop treatment outcome equations using the full set of these predictors. Promising preliminary empirical results coupled with recent developments in statistical methodology suggest that models could be developed to provide useful clinical decision support in personalised treatment selection. These tools could also provide a strong foundation to increase statistical power in focused studies of biomarkers and MDD heterogeneity of treatment response in subsequent controlled trials. CONCLUSIONS Coordinated efforts are needed to develop a protocol for systematically collecting information about established predictors of heterogeneity of MDD treatment response in large observational treatment studies, applying and refining these models in subsequent pragmatic trials, carrying out pooled secondary analyses to extract the maximum amount of information from these coordinated studies, and using this information to focus future discovery efforts in the segment of the patient population in which continued uncertainty about treatment response exists.
Collapse
|
41
|
Evaluation of associative classification-based multifactor dimensionality reduction in the presence of noise. ACTA ACUST UNITED AC 2016. [DOI: 10.1007/s13721-016-0114-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
42
|
König IR, Auerbach J, Gola D, Held E, Holzinger ER, Legault MA, Sun R, Tintle N, Yang HC. Machine learning and data mining in complex genomic data--a review on the lessons learned in Genetic Analysis Workshop 19. BMC Genet 2016; 17 Suppl 2:1. [PMID: 26866367 PMCID: PMC4895282 DOI: 10.1186/s12863-015-0315-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
In the analysis of current genomic data, application of machine learning and data mining techniques has become more attractive given the rising complexity of the projects. As part of the Genetic Analysis Workshop 19, approaches from this domain were explored, mostly motivated from two starting points. First, assuming an underlying structure in the genomic data, data mining might identify this and thus improve downstream association analyses. Second, computational methods for machine learning need to be developed further to efficiently deal with the current wealth of data.In the course of discussing results and experiences from the machine learning and data mining approaches, six common messages were extracted. These depict the current state of these approaches in the application to complex genomic data. Although some challenges remain for future studies, important forward steps were taken in the integration of different data types and the evaluation of the evidence. Mining the data for underlying genetic or phenotypic structure and using this information in subsequent analyses proved to be extremely helpful and is likely to become of even greater use with more complex data sets.
Collapse
Affiliation(s)
- Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.
| | - Jonathan Auerbach
- Department of Statistics, Columbia University, New York, NY, 10027, USA.
| | - Damian Gola
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.
| | - Elizabeth Held
- Department of Mathematics, Iowa State University, Ames, IA, 50011, USA.
| | - Emily R Holzinger
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Baltimore, MD, 21224, USA.
| | - Marc-André Legault
- Université de Montréal, Faculty of Medicine, 2900 Chemin de la Tour, Montreal, QC, H3T 1N8, Canada.
| | - Rui Sun
- Division of Biostatistics, School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, Hong Kong SAR.
| | - Nathan Tintle
- Department of Mathematics, Statistics and Computer Science, Dordt College, Sioux Center, IA, 51250, USA.
| | - Hsin-Chou Yang
- Institute of Statistical Science, Academia Sinica, Nankang 115, Taipei, Taiwan.
| |
Collapse
|
43
|
Lynch SM, Moore JH. A call for biological data mining approaches in epidemiology. BioData Min 2016; 9:1. [PMID: 26734074 PMCID: PMC4700596 DOI: 10.1186/s13040-015-0079-8] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2015] [Accepted: 12/29/2015] [Indexed: 12/28/2022] Open
Affiliation(s)
- Shannon M Lynch
- Cancer Prevention and Control, Fox Chase Cancer Center, Philadelphia, PA 19111 USA
| | - Jason H Moore
- Department of Biostatistics and Epidemiology, Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104-6021 USA
| |
Collapse
|
44
|
Jeanquartier F, Jean-Quartier C, Kotlyar M, Tokar T, Hauschild AC, Jurisica I, Holzinger A. Machine Learning for In Silico Modeling of Tumor Growth. LECTURE NOTES IN COMPUTER SCIENCE 2016. [DOI: 10.1007/978-3-319-50478-0_21] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
|
45
|
A Multifactor Dimensionality Reduction Based Associative Classification for Detecting SNP Interactions. ACTA ACUST UNITED AC 2015. [DOI: 10.1007/978-3-319-26532-2_36] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
|
46
|
Mas S, Gassó P, Lafuente A. Applicability of gene expression and systems biology to develop pharmacogenetic predictors; antipsychotic-induced extrapyramidal symptoms as an example. Pharmacogenomics 2015; 16:1975-88. [PMID: 26556470 DOI: 10.2217/pgs.15.134] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Pharmacogenetics has been driven by a candidate gene approach. The disadvantage of this approach is that is limited by our current understanding of the mechanisms by which drugs act. Gene expression could help to elucidate the molecular signatures of antipsychotic treatments searching for dysregulated molecular pathways and the relationships between gene products, especially protein-protein interactions. To embrace the complexity of drug response, machine learning methods could help to identify gene-gene interactions and develop pharmacogenetic predictors of drug response. The present review summarizes the applicability of the topics presented here (gene expression, network analysis and gene-gene interactions) in pharmacogenetics. In order to achieve this, we present an example of identifying genetic predictors of extrapyramidal symptoms induced by antipsychotic.
Collapse
Affiliation(s)
- Sergi Mas
- Department of Pathological Anatomy, Pharmacology & Microbiology, University of Barcelona, Spain.,Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain.,Centro de Investigación Biomédica en Red de Salud Mental (CIBERSAM), Spain
| | - Patricia Gassó
- Department of Pathological Anatomy, Pharmacology & Microbiology, University of Barcelona, Spain.,Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain
| | - Amelia Lafuente
- Department of Pathological Anatomy, Pharmacology & Microbiology, University of Barcelona, Spain.,Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain.,Centro de Investigación Biomédica en Red de Salud Mental (CIBERSAM), Spain
| |
Collapse
|
47
|
Carey CE, Agrawal A, Zhang B, Conley ED, Degenhardt L, Heath AC, Li D, Lynskey MT, Martin NG, Montgomery GW, Wang T, Bierut LJ, Hariri AR, Nelson EC, Bogdan R. Monoacylglycerol lipase (MGLL) polymorphism rs604300 interacts with childhood adversity to predict cannabis dependence symptoms and amygdala habituation: Evidence from an endocannabinoid system-level analysis. JOURNAL OF ABNORMAL PSYCHOLOGY 2015; 124:860-77. [PMID: 26595473 PMCID: PMC4700831 DOI: 10.1037/abn0000079] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Despite evidence for heritable variation in cannabis involvement and the discovery of cannabinoid receptors and their endogenous ligands, no consistent patterns have emerged from candidate endocannabinoid (eCB) genetic association studies of cannabis involvement. Given interactions between eCB and stress systems and associations between childhood stress and cannabis involvement, it may be important to consider childhood adversity in the context of eCB-related genetic variation. We employed a system-level gene-based analysis of data from the Comorbidity and Trauma Study (N = 1,558) to examine whether genetic variation in six eCB genes (anabolism: DAGLA, DAGLB, NAPEPLD; catabolism: MGLL, FAAH; binding: CNR1; SNPs N = 65) and childhood sexual abuse (CSA) predict cannabis dependence symptoms. Significant interactions with CSA emerged for MGLL at the gene level (p = .009), and for rs604300 within MGLL (ΔR2 = .007, p < .001), the latter of which survived SNP-level Bonferroni correction and was significant in an additional sample with similar directional effects (N = 859; ΔR2 = .005, p = .026). Furthermore, in a third sample (N = 312), there was evidence that rs604300 genotype interacts with early life adversity to predict threat-related basolateral amygdala habituation, a neural phenotype linked to the eCB system and addiction (ΔR2 = .013, p = .047). Rs604300 may be related to epigenetic modulation of MGLL expression. These results are consistent with rodent models implicating 2-arachidonoylglycerol (2-AG), an endogenous cannabinoid metabolized by the enzyme encoded by MGLL, in the etiology of stress adaptation related to cannabis dependence, but require further replication.
Collapse
Affiliation(s)
- Caitlin E Carey
- Department of Psychology, Washington University in St. Louis
| | - Arpana Agrawal
- Department of Psychiatry, Washington University in St. Louis
| | - Bo Zhang
- Department of Genetics, Washington University in St. Louis
| | | | - Louisa Degenhardt
- National Drug and Alcohol Research Centre, University of New South Wales
| | - Andrew C Heath
- Department of Psychiatry, Washington University in St. Louis
| | - Daofeng Li
- Department of Genetics, Washington University in St. Louis
| | | | | | | | - Ting Wang
- Department of Genetics, Washington University in St. Louis
| | - Laura J Bierut
- Department of Psychiatry, Washington University in St. Louis
| | - Ahmad R Hariri
- Department of Psychology and Neuroscience, Duke University
| | - Elliot C Nelson
- Department of Psychiatry, Washington University in St. Louis
| | - Ryan Bogdan
- Department of Psychology, Washington University in St. Louis
| |
Collapse
|
48
|
Lee WP, Lin CH. Combining Expression Data and Knowledge Ontology for Gene Clustering and Network Reconstruction. Cognit Comput 2015. [DOI: 10.1007/s12559-015-9349-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
49
|
Abstract
The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.
Collapse
Affiliation(s)
- Maxwell W Libbrecht
- Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195-2350, USA
| | - William Stafford Noble
- 1] Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195-2350, USA. [2] Department of Genome Sciences, University of Washington, 3720 15th Ave NE Seattle, Washington 98195-5065, USA
| |
Collapse
|
50
|
A gene-based information gain method for detecting gene-gene interactions in case-control studies. Eur J Hum Genet 2015; 23:1566-72. [PMID: 25758991 DOI: 10.1038/ejhg.2015.16] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2014] [Revised: 11/30/2014] [Accepted: 01/14/2015] [Indexed: 12/31/2022] Open
Abstract
Currently, most methods for detecting gene-gene interactions (GGIs) in genome-wide association studies are divided into SNP-based methods and gene-based methods. Generally, the gene-based methods can be more powerful than SNP-based methods. Some gene-based entropy methods can only capture the linear relationship between genes. We therefore proposed a nonparametric gene-based information gain method (GBIGM) that can capture both linear relationship and nonlinear correlation between genes. Through simulation with different odds ratio, sample size and prevalence rate, GBIGM was shown to be valid and more powerful than classic KCCU method and SNP-based entropy method. In the analysis of data from 17 genes on rheumatoid arthritis, GBIGM was more effective than the other two methods as it obtains fewer significant results, which was important for biological verification. Therefore, GBIGM is a suitable and powerful tool for detecting GGIs in case-control studies.
Collapse
|