1
|
Bohlin J, Håberg SE, Magnus P, Gjessing HK. MinLinMo: a minimalist approach to variable selection and linear model prediction. BMC Bioinformatics 2024; 25:380. [PMID: 39695947 DOI: 10.1186/s12859-024-06000-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2024] [Accepted: 11/26/2024] [Indexed: 12/20/2024] Open
Abstract
Generating prediction models from high dimensional data often result in large models with many predictors. Causal inference for such models can therefore be difficult or even impossible in practice. The stand-alone software package MinLinMo emphasizes small linear prediction models over highest possible predictability with a particular focus on including variables correlated with the outcome, minimal memory usage and speed. MinLinMo is demonstrated on large epigenetic datasets with prediction models for chronological age, gestational age, and birth weight comprising, respectively, 15, 14 and 10 predictors. The parsimonious MinLinMo models perform comparably to established prediction models requiring hundreds of predictors.
Collapse
Affiliation(s)
- Jon Bohlin
- Department of Method Development and Analytics, Section for modeling and bioinformatics, Norwegian Institute of Public Health, Oslo, Norway.
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway.
| | - Siri E Håberg
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway
- Department of Global Public Health and Primary Care, University of Bergen, Bergen, Norway
| | - Per Magnus
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Håkon K Gjessing
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway
- Department of Global Public Health and Primary Care, University of Bergen, Bergen, Norway
| |
Collapse
|
2
|
Haftorn KL, Romanowska J, Lee Y, Page CM, Magnus PM, Håberg SE, Bohlin J, Jugessur A, Denault WRP. Stability selection enhances feature selection and enables accurate prediction of gestational age using only five DNA methylation sites. Clin Epigenetics 2023; 15:114. [PMID: 37443060 PMCID: PMC10339624 DOI: 10.1186/s13148-023-01528-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Accepted: 06/29/2023] [Indexed: 07/15/2023] Open
Abstract
BACKGROUND DNA methylation (DNAm) is robustly associated with chronological age in children and adults, and gestational age (GA) in newborns. This property has enabled the development of several epigenetic clocks that can accurately predict chronological age and GA. However, the lack of overlap in predictive CpGs across different epigenetic clocks remains elusive. Our main aim was therefore to identify and characterize CpGs that are stably predictive of GA. RESULTS We applied a statistical approach called 'stability selection' to DNAm data from 2138 newborns in the Norwegian Mother, Father, and Child Cohort study. Stability selection combines subsampling with variable selection to restrict the number of false discoveries in the set of selected variables. Twenty-four CpGs were identified as being stably predictive of GA. Intriguingly, only up to 10% of the CpGs in previous GA clocks were found to be stably selected. Based on these results, we used generalized additive model regression to develop a new GA clock consisting of only five CpGs, which showed a similar predictive performance as previous GA clocks (R2 = 0.674, median absolute deviation = 4.4 days). These CpGs were in or near genes and regulatory regions involved in immune responses, metabolism, and developmental processes. Furthermore, accounting for nonlinear associations improved prediction performance in preterm newborns. CONCLUSION We present a methodological framework for feature selection that is broadly applicable to any trait that can be predicted from DNAm data. We demonstrate its utility by identifying CpGs that are highly predictive of GA and present a new and highly performant GA clock based on only five CpGs that is more amenable to a clinical setting.
Collapse
Affiliation(s)
- Kristine L Haftorn
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway.
- Institute of Health and Society, University of Oslo, Oslo, Norway.
| | - Julia Romanowska
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway
- Department of Global Public Health and Primary Care, University of Bergen, 5020, Bergen, Norway
| | - Yunsung Lee
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Christian M Page
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway
- Division for Mental and Physical Health, Department of Physical Health and Aging, Norwegian Institute of Public Health, Oslo, Norway
| | - Per M Magnus
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Siri E Håberg
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Jon Bohlin
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway
- Division for Infection Control and Environmental Health, Department of Infectious Disease Epidemiology and Modelling, Norwegian Institute of Public Health, Oslo, Norway
| | - Astanand Jugessur
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway
- Department of Global Public Health and Primary Care, University of Bergen, 5020, Bergen, Norway
| | - William R P Denault
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway
- Department of Human Genetics, University of Chicago, Chicago, IL, 60637, USA
| |
Collapse
|
3
|
Miao R, Dang Q, Cai J, Huang HH, Xie SL, Liang Y. Sparse principal component analysis based on genome network for correcting cell type heterogeneity in epigenome-wide association studies. Med Biol Eng Comput 2022; 60:2601-2618. [DOI: 10.1007/s11517-022-02599-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Accepted: 04/30/2022] [Indexed: 10/17/2022]
|