1
|
Wani A, Katrinli S, Zhao X, Daskalakis N, Zannas A, Aiello A, Baker D, Boks M, Brick L, Chen CY, Dalvie S, Fortier C, Geuze E, Hayes J, Kessler R, King A, Koen N, Liberzon I, Lori A, Luykx J, Maihofer A, Milberg W, Miller M, Mufford M, Nugent N, Rauch S, Ressler K, Risbrough V, Rutten B, Stein D, Stein M, Ursano R, Verfaellie M, Ware E, Wildman D, Wolf E, Nievergelt C, Logue M, Smith A, Uddin M, Vermetten E, Vinkers C. Blood-based DNA methylation and exposure risk scores predict PTSD with high accuracy in military and civilian cohorts. RESEARCH SQUARE 2024:rs.3.rs-3952163. [PMID: 38410438 PMCID: PMC10896387 DOI: 10.21203/rs.3.rs-3952163/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/28/2024]
Abstract
Background Incorporating genomic data into risk prediction has become an increasingly useful approach for rapid identification of individuals most at risk for complex disorders such as PTSD. Our goal was to develop and validate Methylation Risk Scores (MRS) using machine learning to distinguish individuals who have PTSD from those who do not. Methods Elastic Net was used to develop three risk score models using a discovery dataset (n = 1226; 314 cases, 912 controls) comprised of 5 diverse cohorts with available blood-derived DNA methylation (DNAm) measured on the Illumina Epic BeadChip. The first risk score, exposure and methylation risk score (eMRS) used cumulative and childhood trauma exposure and DNAm variables; the second, methylation-only risk score (MoRS) was based solely on DNAm data; the third, methylation-only risk scores with adjusted exposure variables (MoRSAE) utilized DNAm data adjusted for the two exposure variables. The potential of these risk scores to predict future PTSD based on pre-deployment data was also assessed. External validation of risk scores was conducted in four independent cohorts. Results The eMRS model showed the highest accuracy (92%), precision (91%), recall (87%), and f1-score (89%) in classifying PTSD using 3730 features. While still highly accurate, the MoRS (accuracy = 89%) using 3728 features and MoRSAE (accuracy = 84%) using 4150 features showed a decline in classification power. eMRS significantly predicted PTSD in one of the four independent cohorts, the BEAR cohort (beta = 0.6839, p-0.003), but not in the remaining three cohorts. Pre-deployment risk scores from all models (eMRS, beta = 1.92; MoRS, beta = 1.99 and MoRSAE, beta = 1.77) displayed a significant (p < 0.001) predictive power for post-deployment PTSD. Conclusion Results, especially those from the eMRS, reinforce earlier findings that methylation and trauma are interconnected and can be leveraged to increase the correct classification of those with vs. without PTSD. Moreover, our models can potentially be a valuable tool in predicting the future risk of developing PTSD. As more data become available, including additional molecular, environmental, and psychosocial factors in these scores may enhance their accuracy in predicting the condition and, relatedly, improve their performance in independent cohorts.
Collapse
Affiliation(s)
- Agaz Wani
- University of South Florida College of Public Health, Genomics Program
| | - Seyma Katrinli
- Emory University Department of Gynecology and Obstetrics
| | - Xiang Zhao
- Boston University School of Public Health
| | | | - Anthony Zannas
- University of North Carolina at Chapel Hill, Carolina Stress Initiative
| | - Allison Aiello
- Robert N Butler Columbia Aging Center, Columbia University
| | - Dewleen Baker
- University of California San Diego, Department of Psychiatry
| | - Marco Boks
- Brain Center University Medical Center Utrecht, Department of Psychiatry
| | | | | | | | | | - Elbert Geuze
- Netherlands Ministry of Defence, Brain Research and Innovation Centre
| | | | - Ronald Kessler
- Harvard Medical School, Department of Health Care Policy
| | - Anthony King
- The Ohio State University, College of Medicine, Institute for Behavioral Medicine Research
| | - Nastassja Koen
- University of Cape Town, Department of Psychiatry & Mental Health
| | - Israel Liberzon
- Texas A&M University College of Medicine, Department of Psychiatry and Behavioral Sciences
| | - Adriana Lori
- Emory University, Department of Psychiatry and Behavioral Sciences
| | - Jurjen Luykx
- UMC Utrecht Brain Center Rudolf Magnus, Department of Psychiatry
| | | | | | - Mark Miller
- Boston University School of Medicine, Psychiatry
| | | | - Nicole Nugent
- Alpert Brown Medical School, Department of Emergency Medicine
| | - Sheila Rauch
- Emory University, Department of Psychiatry & Behavioral Sciences
| | | | | | - Bart Rutten
- Maastricht Universitair Medisch Centrum, School for Mental Health and Neuroscience, Department of Psychiatry and Neuropsychology
| | - Dan Stein
- University of Cape Town, Department of Psychiatry & Mental Health
| | - Murrary Stein
- University of California San Diego, Department of Psychiatry
| | - Robert Ursano
- Uniformed Services University, Department of Psychiatry
| | | | - Erin Ware
- University of Michigan, Population Studies Center
| | - Derek Wildman
- University of South Florida College of Public Health, Genomics Program
| | - Erika Wolf
- VA Boston Healthcare System, National Center for PTSD
| | | | - Mark Logue
- Boston University School of Public Health
| | - Alicia Smith
- Emory University Department of Gynecology and Obstetrics
| | - Monica Uddin
- University of South Florida College of Public Health, Genomics Program
| | - Eric Vermetten
- Leiden University Medical Center, Department of Psychiatry
| | - Christiaan Vinkers
- Amsterdam Neuroscience, Mood, Anxiety, Psychosis, Sleep & Stress Program
| |
Collapse
|
2
|
Kim J, Koh H. MiTree: A Unified Web Cloud Analytic Platform for User-Friendly and Interpretable Microbiome Data Mining Using Tree-Based Methods. Microorganisms 2023; 11:2816. [PMID: 38004827 PMCID: PMC10672986 DOI: 10.3390/microorganisms11112816] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 11/05/2023] [Accepted: 11/17/2023] [Indexed: 11/26/2023] Open
Abstract
The advent of next-generation sequencing has greatly accelerated the field of human microbiome studies. Currently, investigators are seeking, struggling and competing to find new ways to diagnose, treat and prevent human diseases through the human microbiome. Machine learning is a promising approach to help such an effort, especially due to the high complexity of microbiome data. However, many of the current machine learning algorithms are in a "black box", i.e., they are difficult to understand and interpret. In addition, clinicians, public health practitioners and biologists are not usually skilled at computer programming, and they do not always have high-end computing devices. Thus, in this study, we introduce a unified web cloud analytic platform, named MiTree, for user-friendly and interpretable microbiome data mining. MiTree employs tree-based learning methods, including decision tree, random forest and gradient boosting, that are well understood and suited to human microbiome studies. We also stress that MiTree can address both classification and regression problems through covariate-adjusted or unadjusted analysis. MiTree should serve as an easy-to-use and interpretable data mining tool for microbiome-based disease prediction modeling, and should provide new insights into microbiome-based diagnostics, treatment and prevention. MiTree is an open-source software that is available on our web server.
Collapse
|
3
|
Kohn R, Harhay MO, Weissman GE, Urbanowicz R, Wang W, Anesi GL, Scott S, Bayes B, Greysen SR, Halpern SD, Kerlin MP. A Data-Driven Analysis of Ward Capacity Strain Metrics That Predict Clinical Outcomes Among Survivors of Acute Respiratory Failure. J Med Syst 2023; 47:83. [PMID: 37542590 DOI: 10.1007/s10916-023-01978-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Accepted: 07/18/2023] [Indexed: 08/07/2023]
Abstract
Supply-demand mismatch of ward resources ("ward capacity strain") alters care and outcomes. Narrow strain definitions and heterogeneous populations limit strain literature. Evaluate the predictive utility of a large set of candidate strain variables for in-hospital mortality and discharge destination among acute respiratory failure (ARF) survivors. In a retrospective cohort of ARF survivors transferred from intensive care units (ICUs) to wards in five hospitals from 4/2017-12/2019, we applied 11 machine learning (ML) models to identify ward strain measures during the first 24 hours after transfer most predictive of outcomes. Measures spanned patient volume (census, admissions, discharges), staff workload (medications administered, off-ward transports, transfusions, isolation precautions, patients per respiratory therapist and nurse), and average patient acuity (Laboratory Acute Physiology Score version 2, ICU transfers) domains. The cohort included 5,052 visits in 43 wards. Median age was 65 years (IQR 56-73); 2,865 (57%) were male; and 2,865 (57%) were white. 770 (15%) patients died in the hospital or had hospice discharges, and 2,628 (61%) were discharged home and 964 (23%) to skilled nursing facilities (SNFs). Ward admissions, isolation precautions, and hospital admissions most consistently predicted in-hospital mortality across ML models. Patients per nurse most consistently predicted discharge to home and SNF, and medications administered predicted SNF discharge. In this hypothesis-generating analysis of candidate ward strain variables' prediction of outcomes among ARF survivors, several variables emerged as consistently predictive of key outcomes across ML models. These findings suggest targets for future inferential studies to elucidate mechanisms of ward strain's adverse effects.
Collapse
Affiliation(s)
- Rachel Kohn
- Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
- Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania, Philadelphia, PA, USA.
- Leonard Davis Institute of Health Economics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| | - Michael O Harhay
- Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania, Philadelphia, PA, USA
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Gary E Weissman
- Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania, Philadelphia, PA, USA
- Leonard Davis Institute of Health Economics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | | | - Wei Wang
- Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania, Philadelphia, PA, USA
| | - George L Anesi
- Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania, Philadelphia, PA, USA
- Leonard Davis Institute of Health Economics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Stefania Scott
- Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania, Philadelphia, PA, USA
| | - Brian Bayes
- Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania, Philadelphia, PA, USA
| | - S Ryan Greysen
- Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Leonard Davis Institute of Health Economics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Scott D Halpern
- Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania, Philadelphia, PA, USA
- Leonard Davis Institute of Health Economics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Department of Medical Ethics and Health Policy, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Meeta Prasad Kerlin
- Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania, Philadelphia, PA, USA
- Leonard Davis Institute of Health Economics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
4
|
Freda PJ, Ghosh A, Zhang E, Luo T, Chitre AS, Polesskaya O, St Pierre CL, Gao J, Martin CD, Chen H, Garcia-Martinez AG, Wang T, Han W, Ishiwari K, Meyer P, Lamparelli A, King CP, Palmer AA, Li R, Moore JH. Automated quantitative trait locus analysis (AutoQTL). BioData Min 2023; 16:14. [PMID: 37038201 PMCID: PMC10088184 DOI: 10.1186/s13040-023-00331-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 03/31/2023] [Indexed: 04/12/2023] Open
Abstract
BACKGROUND Quantitative Trait Locus (QTL) analysis and Genome-Wide Association Studies (GWAS) have the power to identify variants that capture significant levels of phenotypic variance in complex traits. However, effort and time are required to select the best methods and optimize parameters and pre-processing steps. Although machine learning approaches have been shown to greatly assist in optimization and data processing, applying them to QTL analysis and GWAS is challenging due to the complexity of large, heterogenous datasets. Here, we describe proof-of-concept for an automated machine learning approach, AutoQTL, with the ability to automate many complicated decisions related to analysis of complex traits and generate solutions to describe relationships that exist in genetic data. RESULTS Using a publicly available dataset of 18 putative QTL from a large-scale GWAS of body mass index in the laboratory rat, Rattus norvegicus, AutoQTL captures the phenotypic variance explained under a standard additive model. AutoQTL also detects evidence of non-additive effects including deviations from additivity and 2-way epistatic interactions in simulated data via multiple optimal solutions. Additionally, feature importance metrics provide different insights into the inheritance models and predictive power of multiple GWAS-derived putative QTL. CONCLUSIONS This proof-of-concept illustrates that automated machine learning techniques can complement standard approaches and have the potential to detect both additive and non-additive effects via various optimal solutions and feature importance metrics. In the future, we aim to expand AutoQTL to accommodate omics-level datasets with intelligent feature selection and feature engineering strategies.
Collapse
Affiliation(s)
- Philip J Freda
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center, Suite G540, West Hollywood, CA, 90069, USA
| | - Attri Ghosh
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center, Suite G540, West Hollywood, CA, 90069, USA
| | - Elizabeth Zhang
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center, Suite G540, West Hollywood, CA, 90069, USA
| | - Tianhao Luo
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center, Suite G540, West Hollywood, CA, 90069, USA
| | - Apurva S Chitre
- Department of Psychiatry, University of California San Diego, 9500 Gilman Dr., Mail Code: 0667, La Jolla, CA, 92093-0667, USA
| | - Oksana Polesskaya
- Department of Psychiatry, University of California San Diego, 9500 Gilman Dr., Mail Code: 0667, La Jolla, CA, 92093-0667, USA
| | - Celine L St Pierre
- Department of Psychiatry, University of California San Diego, 9500 Gilman Dr., Mail Code: 0667, La Jolla, CA, 92093-0667, USA
| | - Jianjun Gao
- Department of Psychiatry, University of California San Diego, 9500 Gilman Dr., Mail Code: 0667, La Jolla, CA, 92093-0667, USA
| | - Connor D Martin
- Department of Pharmacology & Toxicology, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, 955 Main Street, Suite 3102, Buffalo, NY, 14203, USA
| | - Hao Chen
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Translational Research Building, 71 South Manassas, Memphis, TN, 38163, USA
| | - Angel G Garcia-Martinez
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Translational Research Building, 71 South Manassas, Memphis, TN, 38163, USA
| | - Tengfei Wang
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Translational Research Building, 71 South Manassas, Memphis, TN, 38163, USA
| | - Wenyan Han
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Translational Research Building, 71 South Manassas, Memphis, TN, 38163, USA
| | - Keita Ishiwari
- Department of Pharmacology & Toxicology, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, 955 Main Street, Suite 3102, Buffalo, NY, 14203, USA
- Clinical and Research Institute on Addictions, University at Buffalo, 1021 Main Street, Buffalo, NY, 14203-1016, USA
| | - Paul Meyer
- Department of Psychology, University at Buffalo, 204 Park Hall, North Campus, Buffalo, NY, 14260-4110, USA
| | - Alexander Lamparelli
- Department of Psychology, University at Buffalo, 204 Park Hall, North Campus, Buffalo, NY, 14260-4110, USA
| | - Christopher P King
- Department of Psychology, University at Buffalo, 204 Park Hall, North Campus, Buffalo, NY, 14260-4110, USA
| | - Abraham A Palmer
- Department of Psychiatry, University of California San Diego, 9500 Gilman Dr., Mail Code: 0667, La Jolla, CA, 92093-0667, USA
- Institute for Genomic Medicine, University of California San Diego, 9500 Gilman Dr., Mail Code: 0667, La Jolla, CA, 92093-0667, USA
| | - Ruowang Li
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center, Suite G540, West Hollywood, CA, 90069, USA
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center, Suite G540, West Hollywood, CA, 90069, USA.
| |
Collapse
|
5
|
Frndak S, Yu G, Oulhote Y, Queirolo EI, Barg G, Vahter M, Mañay N, Peregalli F, Olson JR, Ahmed Z, Kordas K. Reducing the complexity of high-dimensional environmental data: An analytical framework using LASSO with considerations of confounding for statistical inference. Int J Hyg Environ Health 2023; 249:114116. [PMID: 36805184 PMCID: PMC10977870 DOI: 10.1016/j.ijheh.2023.114116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Revised: 01/10/2023] [Accepted: 01/17/2023] [Indexed: 02/19/2023]
Abstract
PURPOSE Frameworks for selecting exposures in high-dimensional environmental datasets, while considering confounding, are lacking. We present a two-step approach for exposure selection with subsequent confounder adjustment for statistical inference. METHODS We measured cognitive ability in 338 children using the Woodcock-Muñoz General Intellectual Ability (GIA) score, and potential associated features across several environmental domains. Initially, 111 variables theoretically associated with GIA score were introduced into a Least Absolute Shrinkage and Selection Operator (LASSO) in a 50% feature selection subsample. Effect estimates for selected features were subsequently modeled in linear regressions in a 50% inference (hold out) subsample, first adjusting for sex and age and later for covariates selected via directed acyclic graphs (DAGs). All models were adjusted for clustering by school. RESULTS Of the 15 LASSO selected variables, eleven were not associated with GIA score following our inference modeling approach. Four variables were associated with GIA scores, including: serum ferritin adjusted for inflammation (inversely), mother's IQ (positively), father's education (positively), and hours per day the child works on homework (positively). Serum ferritin was not in the expected direction. CONCLUSIONS Our two-step approach moves high-dimensional feature selection a step further by incorporating DAG-based confounder adjustment for statistical inference.
Collapse
Affiliation(s)
- Seth Frndak
- Department of Epidemiology and Environmental Health: University at Buffalo, The State University of New York, USA.
| | - Guan Yu
- Department of Biostatistics: University of Pittsburgh, USA
| | - Youssef Oulhote
- Department of Epidemiology, University of Massachusetts Amherst, USA
| | - Elena I Queirolo
- Department of Neuroscience and Learning, Catholic University of Uruguay, Montevideo, Uruguay
| | - Gabriel Barg
- Department of Neuroscience and Learning, Catholic University of Uruguay, Montevideo, Uruguay
| | - Marie Vahter
- Department of Environmental Medicine: Karolinska Institute, Sweden
| | - Nelly Mañay
- Faculty of Chemistry, University of the Republic of Uruguay (UDELAR), Montevideo, Uruguay
| | - Fabiana Peregalli
- Department of Neuroscience and Learning, Catholic University of Uruguay, Montevideo, Uruguay
| | - James R Olson
- Department of Epidemiology and Environmental Health: University at Buffalo, The State University of New York, USA
| | - Zia Ahmed
- Research and Education in eNergy, Environment and Water (RENEW) Institute University at Buffalo, The State University of New York, USA
| | - Katarzyna Kordas
- Department of Epidemiology and Environmental Health: University at Buffalo, The State University of New York, USA
| |
Collapse
|
6
|
Automated quantitative trait locus analysis (AutoQTL). BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.12.523835. [PMID: 36711526 PMCID: PMC9882220 DOI: 10.1101/2023.01.12.523835] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Background Quantitative Trait Locus (QTL) analysis and Genome-Wide Association Studies (GWAS) have the power to identify variants that capture significant levels of phenotypic variance in complex traits. However, effort and time are required to select the best methods and optimize parameters and pre-processing steps. Although machine learning approaches have been shown to greatly assist in optimization and data processing, applying them to QTL analysis and GWAS is challenging due to the complexity of large, heterogenous datasets. Here, we describe proof-of-concept for an automated machine learning approach, AutoQTL, with the ability to automate many complex decisions related to analysis of complex traits and generate diverse solutions to describe relationships that exist in genetic data. Results Using a dataset of 18 putative QTL from a large-scale GWAS of body mass index in the laboratory rat, Rattus norvegicus , AutoQTL captures the phenotypic variance explained under a standard additive model while also providing evidence of non-additive effects including deviations from additivity and 2-way epistatic interactions from simulated data via multiple optimal solutions. Additionally, feature importance metrics provide different insights into the inheritance models and predictive power of multiple GWAS-derived putative QTL. Conclusions This proof-of-concept illustrates that automated machine learning techniques can be applied to genetic data and has the potential to detect both additive and non-additive effects via various optimal solutions and feature importance metrics. In the future, we aim to expand AutoQTL to accommodate omics-level datasets with intelligent feature selection strategies.
Collapse
|
7
|
Manduchi E, Le TT, Fu W, Moore JH. Genetic Analysis of Coronary Artery Disease Using Tree-Based Automated Machine Learning Informed By Biology-Based Feature Selection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1379-1386. [PMID: 34310318 PMCID: PMC9291719 DOI: 10.1109/tcbb.2021.3099068] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Machine Learning (ML) approaches are increasingly being used in biomedical applications. Important challenges of ML include choosing the right algorithm and tuning the parameters for optimal performance. Automated ML (AutoML) methods, such as Tree-based Pipeline Optimization Tool (TPOT), have been developed to take some of the guesswork out of ML thus making this technology available to users from more diverse backgrounds. The goals of this study were to assess applicability of TPOT to genomics and to identify combinations of single nucleotide polymorphisms (SNPs) associated with coronary artery disease (CAD), with a focus on genes with high likelihood of being good CAD drug targets. We leveraged public functional genomic resources to group SNPs into biologically meaningful sets to be selected by TPOT. We applied this strategy to data from the U.K. Biobank, detecting a strikingly recurrent signal stemming from a group of 28 SNPs. Importance analysis of these SNPs uncovered functional relevance of the top SNPs to genes whose association with CAD is supported in the literature and other resources. Furthermore, we employed game-theory based metrics to study SNP contributions to individual-level TPOT predictions and discover distinct clusters of well-predicted CAD cases. The latter indicates a promising approach towards precision medicine.
Collapse
|
8
|
Musolf AM, Holzinger ER, Malley JD, Bailey-Wilson JE. What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics. Hum Genet 2021; 141:1515-1528. [PMID: 34862561 PMCID: PMC9360120 DOI: 10.1007/s00439-021-02402-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Accepted: 11/08/2021] [Indexed: 01/26/2023]
Abstract
Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.
Collapse
Affiliation(s)
- Anthony M Musolf
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA
| | - Emily R Holzinger
- Target Sciences, Informatics and Predictive Sciences, Bristol Myers Squibb, Cambridge, MA, USA
| | - James D Malley
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA
| | - Joan E Bailey-Wilson
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA.
| |
Collapse
|
9
|
Manduchi E, Moore JH. Leveraging Automated Machine Learning for the Analysis of Global Public Health Data: A Case Study in Malaria. Int J Public Health 2021; 66:614296. [PMID: 34744577 PMCID: PMC8565284 DOI: 10.3389/ijph.2021.614296] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 03/17/2021] [Indexed: 11/13/2022] Open
Affiliation(s)
- Elisabetta Manduchi
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, United States.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, United States
| | - Jason H Moore
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, United States.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, United States
| |
Collapse
|