1
|
Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank. ARXIV 2024:arXiv:2404.17626v2. [PMID: 38764589 PMCID: PMC11100914] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 05/21/2024]
Abstract
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals, underscoring a critical gap in genetic research. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data. We evaluate the performance of Group-LASSO INTERaction-NET (glinternet) and pretrained lasso in disease prediction focusing on diverse ancestries in the UK Biobank. Models were trained on data from White British and other ancestries and validated across a cohort of over 96,000 individuals for 8 diseases. Out of 96 models trained, we report 16 with statistically significant incremental predictive performance in terms of ROC-AUC scores (p-value < 0.05), found for diabetes, arthritis, gall stones, cystitis, asthma and osteoarthritis. For the interaction and pretrained models that outperformed the baseline, the PRS score was the primary driver behind prediction. Our findings indicate that both interaction terms and pre-training can enhance prediction accuracy but for a limited set of diseases and moderate improvements in accuracy.
Collapse
|
2
|
Beyond guilty by association at scale: searching for causal variants on the basis of genome-wide summary statistics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.28.582621. [PMID: 38464202 PMCID: PMC10925326 DOI: 10.1101/2024.02.28.582621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Understanding the causal genetic architecture of complex phenotypes is essential for future research into disease mechanisms and potential therapies. Here, we present a novel framework for genome-wide detection of sets of variants that carry non-redundant information on the phenotypes and are therefore more likely to be causal in a biological sense. Crucially, our framework requires only summary statistics obtained from standard genome-wide marginal association testing. The described approach, implemented in open-source software, is also computationally efficient, requiring less than 15 minutes on a single CPU to perform genome-wide analysis. Through extensive genome-wide simulation studies, we show that the method can substantially outperform usual two-stage marginal association testing and fine-mapping procedures in precision and recall. In applications to a meta-analysis of ten large-scale genetic studies of Alzheimer's disease (AD), we identified 82 loci associated with AD, including 37 additional loci missed by conventional GWAS pipeline. The identified putative causal variants achieve state-of-the-art agreement with massively parallel reporter assays and CRISPR-Cas9 experiments. Additionally, we applied the method to a retrospective analysis of 67 large-scale GWAS summary statistics since 2013 for a variety of phenotypes. Results reveal the method's capacity to robustly discover additional loci for polygenic traits and pinpoint potential causal variants underpinning each locus beyond conventional GWAS pipeline, contributing to a deeper understanding of complex genetic architectures in post-GWAS analyses.
Collapse
|
3
|
Temporal dynamics of the multi-omic response to endurance exercise training. Nature 2024; 629:174-183. [PMID: 38693412 PMCID: PMC11062907 DOI: 10.1038/s41586-023-06877-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 11/16/2023] [Indexed: 05/03/2024]
Abstract
Regular exercise promotes whole-body health and prevents disease, but the underlying molecular mechanisms are incompletely understood1-3. Here, the Molecular Transducers of Physical Activity Consortium4 profiled the temporal transcriptome, proteome, metabolome, lipidome, phosphoproteome, acetylproteome, ubiquitylproteome, epigenome and immunome in whole blood, plasma and 18 solid tissues in male and female Rattus norvegicus over eight weeks of endurance exercise training. The resulting data compendium encompasses 9,466 assays across 19 tissues, 25 molecular platforms and 4 training time points. Thousands of shared and tissue-specific molecular alterations were identified, with sex differences found in multiple tissues. Temporal multi-omic and multi-tissue analyses revealed expansive biological insights into the adaptive responses to endurance training, including widespread regulation of immune, metabolic, stress response and mitochondrial pathways. Many changes were relevant to human health, including non-alcoholic fatty liver disease, inflammatory bowel disease, cardiovascular health and tissue injury and recovery. The data and analyses presented in this study will serve as valuable resources for understanding and exploring the multi-tissue molecular effects of endurance training and are provided in a public repository ( https://motrpac-data.org/ ).
Collapse
|
4
|
A modified Michaelis-Menten equation estimates growth from birth to 3 years in healthy babies in the USA. BMC Med Res Methodol 2024; 24:27. [PMID: 38302887 PMCID: PMC10832211 DOI: 10.1186/s12874-024-02145-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 01/08/2024] [Indexed: 02/03/2024] Open
Abstract
BACKGROUND Standard pediatric growth curves cannot be used to impute missing height or weight measurements in individual children. The Michaelis-Menten equation, used for characterizing substrate-enzyme saturation curves, has been shown to model growth in many organisms including nonhuman vertebrates. We investigated whether this equation could be used to interpolate missing growth data in children in the first three years of life and compared this interpolation to several common interpolation methods and pediatric growth models. METHODS We developed a modified Michaelis-Menten equation and compared expected to actual growth, first in a local birth cohort (N = 97) then in a large, outpatient, pediatric sample (N = 14,695). RESULTS The modified Michaelis-Menten equation showed excellent fit for both infant weight (median RMSE: boys: 0.22 kg [IQR:0.19; 90% < 0.43]; girls: 0.20 kg [IQR:0.17; 90% < 0.39]) and height (median RMSE: boys: 0.93 cm [IQR:0.53; 90% < 1.0]; girls: 0.91 cm [IQR:0.50;90% < 1.0]). Growth data were modeled accurately with as few as four values from routine well-baby visits in year 1 and seven values in years 1-3; birth weight or length was essential for best fit. Interpolation with this equation had comparable (for weight) or lower (for height) mean RMSE compared to the best performing alternative models. CONCLUSIONS A modified Michaelis-Menten equation accurately describes growth in healthy babies aged 0-36 months, allowing interpolation of missing weight and height values in individual longitudinal measurement series. The growth pattern in healthy babies in resource-rich environments mirrors an enzymatic saturation curve.
Collapse
|
5
|
Defining Usual Oral Temperature Ranges in Outpatients Using an Unsupervised Learning Algorithm. JAMA Intern Med 2023; 183:1128-1135. [PMID: 37669046 PMCID: PMC10481327 DOI: 10.1001/jamainternmed.2023.4291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Accepted: 07/05/2023] [Indexed: 09/06/2023]
Abstract
Importance Although oral temperature is commonly assessed in medical examinations, the range of usual or "normal" temperature is poorly defined. Objective To determine normal oral temperature ranges by age, sex, height, weight, and time of day. Design, Setting, and Participants This cross-sectional study used clinical visit information from the divisions of Internal Medicine and Family Medicine in a single large medical care system. All adult outpatient encounters that included temperature measurements from April 28, 2008, through June 4, 2017, were eligible for inclusion. The LIMIT (Laboratory Information Mining for Individualized Thresholds) filtering algorithm was applied to iteratively remove encounters with primary diagnoses overrepresented in the tails of the temperature distribution, leaving only those diagnoses unrelated to temperature. Mixed-effects modeling was applied to the remaining temperature measurements to identify independent factors associated with normal oral temperature and to generate individualized normal temperature ranges. Data were analyzed from July 5, 2017, to June 23, 2023. Exposures Primary diagnoses and medications, age, sex, height, weight, time of day, and month, abstracted from each outpatient encounter. Main Outcomes and Measures Normal temperature ranges by age, sex, height, weight, and time of day. Results Of 618 306 patient encounters, 35.92% were removed by LIMIT because they included diagnoses or medications that fell disproportionately in the tails of the temperature distribution. The encounters removed due to overrepresentation in the upper tail were primarily linked to infectious diseases (76.81% of all removed encounters); type 2 diabetes was the only diagnosis removed for overrepresentation in the lower tail (15.71% of all removed encounters). The 396 195 encounters included in the analysis set consisted of 126 705 patients (57.35% women; mean [SD] age, 52.7 [15.9] years). Prior to running LIMIT, the mean (SD) overall oral temperature was 36.71 °C (0.43 °C); following LIMIT, the mean (SD) temperature was 36.64 °C (0.35 °C). Using mixed-effects modeling, age, sex, height, weight, and time of day accounted for 6.86% (overall) and up to 25.52% (per patient) of the observed variability in temperature. Mean normal oral temperature did not reach 37 °C for any subgroup; the upper 99th percentile ranged from 36.81 °C (a tall man with underweight aged 80 years at 8:00 am) to 37.88 °C (a short woman with obesity aged 20 years at 2:00 pm). Conclusions and Relevance The findings of this cross-sectional study suggest that normal oral temperature varies in an expected manner based on sex, age, height, weight, and time of day, allowing individualized normal temperature ranges to be established. The clinical significance of a value outside of the usual range is an area for future study.
Collapse
|
6
|
A modified Michaelis-Menten equation estimates growth from birth to 3 years in healthy babies in the US. RESEARCH SQUARE 2023:rs.3.rs-2375831. [PMID: 36711501 PMCID: PMC9882604 DOI: 10.21203/rs.3.rs-2375831/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
Background and Objectives Standard pediatric growth curves cannot be used to impute missing height or weight measurements in individual children. The Michaelis-Menten equation, used for characterizing substrate-enzyme saturation curves, has been shown to model growth in many organisms including nonhuman vertebrates. We investigated this equation could be used to interpolate missing growth data in children in the first three years of life. Methods We developed a modified Michaelis-Menten equation and compared expected to actual growth, first in a local birth cohort (N=97) then in a large, outpatient, pediatric sample (N=14,695). Results The modified Michaelis-Menten equation showed excellent fit for both infant weight (median RMSE: boys: 0.22kg [IQR:0.19; 90%<0.43]; girls: 0.20kg [IQR:0.17; 90%<0.39]) and height (median RMSE: boys: 0.93cm [IQR:0.53; 90%<1.0]; girls: 0.91cm [IQR:0.50;90%<1.0]). Growth data were modeled accurately with as few as four values from routine well-baby visits in year 1 and seven values in years 1-3; birth weight or length was essential for best fit. Conclusions A modified Michaelis-Menten equation accurately describes growth in healthy babies aged 0-36 months, allowing interpolation of missing weight and height values in individual longitudinal measurement series. The growth pattern in healthy babies in resource-rich environments mirrors an enzymatic saturation curve.
Collapse
|
7
|
A modified Michaelis-Menten equation estimates growth from birth to 3 years in healthy babies in the US. RESEARCH SQUARE 2023:rs.3.rs-2375831. [PMID: 36711501 PMCID: PMC9882604 DOI: 10.21203/rs.3.rs-2375831/v2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/13/2024]
Abstract
Background and Objectives Standard pediatric growth curves cannot be used to impute missing height or weight measurements in individual children. The Michaelis-Menten equation, used for characterizing substrate-enzyme saturation curves, has been shown to model growth in many organisms including nonhuman vertebrates. We investigated this equation could be used to interpolate missing growth data in children in the first three years of life. Methods We developed a modified Michaelis-Menten equation and compared expected to actual growth, first in a local birth cohort (N=97) then in a large, outpatient, pediatric sample (N=14,695). Results The modified Michaelis-Menten equation showed excellent fit for both infant weight (median RMSE: boys: 0.22kg [IQR:0.19; 90%<0.43]; girls: 0.20kg [IQR:0.17; 90%<0.39]) and height (median RMSE: boys: 0.93cm [IQR:0.53; 90%<1.0]; girls: 0.91cm [IQR:0.50;90%<1.0]). Growth data were modeled accurately with as few as four values from routine well-baby visits in year 1 and seven values in years 1-3; birth weight or length was essential for best fit. Conclusions A modified Michaelis-Menten equation accurately describes growth in healthy babies aged 0-36 months, allowing interpolation of missing weight and height values in individual longitudinal measurement series. The growth pattern in healthy babies in resource-rich environments mirrors an enzymatic saturation curve.
Collapse
|
8
|
Abstract
Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. A regularized modification of canonical correlation analysis (RCCA) which imposes an ℓ2 penalty on the CCA coefficients is widely used in applications with high-dimensional data. One limitation of such regularization is that it ignores any data structure, treating all the features equally, which can be ill-suited for some applications. In this article we introduce several approaches to regularizing CCA that take the underlying data structure into account. In particular, the proposed group regularized canonical correlation analysis (GRCCA) is useful when the variables are correlated in groups. We illustrate some computational strategies to avoid excessive computations with regularized CCA in high dimensions. We demonstrate the application of these methods in our motivating application from neuroscience, as well as in a small simulation example.
Collapse
|
9
|
Cross-validation: what does it estimate and how well does it do it? J Am Stat Assoc 2023. [DOI: 10.1080/01621459.2023.2197686] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2023]
|
10
|
Abstract
The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman, Hastie, and Tibshirani (2011) extended this work to Cox models for right-censored data. We further extend the reach of the elastic net-regularized regression to all generalized linear model families, Cox models with (start, stop] data and strata, and a simplified version of the relaxed lasso. We also discuss convenient utility functions for measuring the performance of these fitted models.
Collapse
|
11
|
A tissue atlas of ulcerative colitis revealing evidence of sex-dependent differences in disease-driving inflammatory cell types and resistance to TNF inhibitor therapy. SCIENCE ADVANCES 2023; 9:eadd1166. [PMID: 36662860 PMCID: PMC9858501 DOI: 10.1126/sciadv.add1166] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/21/2022] [Accepted: 12/16/2022] [Indexed: 06/01/2023]
Abstract
Although literature suggests that resistance to TNF inhibitor (TNFi) therapy in patients with ulcerative colitis (UC) is partially linked to immune cell populations in the inflamed region, there is still substantial uncertainty underlying the relevant spatial context. Here, we used the highly multiplexed immunofluorescence imaging technology CODEX to create a publicly browsable tissue atlas of inflammation in 42 tissue regions from 29 patients with UC and 5 healthy individuals. We analyzed 52 biomarkers on 1,710,973 spatially resolved single cells to determine cell types, cell-cell contacts, and cellular neighborhoods. We observed that cellular functional states are associated with cellular neighborhoods. We further observed that a subset of inflammatory cell types and cellular neighborhoods are present in patients with UC with TNFi treatment, potentially indicating resistant niches. Last, we explored applying convolutional neural networks (CNNs) to our dataset with respect to patient clinical variables. We note concerns and offer guidelines for reporting CNN-based predictions in similar datasets.
Collapse
|
12
|
Abstract
In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic net penalty. In our simulations, fwelnet outperforms the lasso in terms of test mean squared error and usually gives an improvement in true positive rate or false positive rate for feature selection. We also apply this method to early prediction of preeclampsia, where fwelnet outperforms the lasso in terms of 10-fold cross-validated area under the curve (0.86 vs. 0.80). We also provide a connection between fwelnet and the group lasso and suggest how fwelnet might be used for multi-task learning.
Collapse
|
13
|
Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2022; 23:291. [PMID: 37102181 PMCID: PMC10129058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm.
Collapse
|
14
|
LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK. Ann Appl Stat 2022; 16:1891-1918. [PMID: 36091495 PMCID: PMC9454085 DOI: 10.1214/21-aoas1575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes)-lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use a scheme that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller subproblems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues, such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present multiSnpnet package, available at http://github.com/junyangq/multiSnpnet that works on top of PLINK2 files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.
Collapse
|
15
|
|
16
|
Abstract
Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors x i ∈ ℝ p are obtained by applying a linear transform to a vector of i.i.d. entries, x i = Σ1/2 z i (with z i ∈ ℝ p ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wz i ) (with z i ∈ ℝ d , W ∈ ℝ p × d a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.
Collapse
|
17
|
Significant sparse polygenic risk scores across 813 traits in UK Biobank. PLoS Genet 2022; 18:e1010105. [PMID: 35324888 PMCID: PMC8946745 DOI: 10.1371/journal.pgen.1010105] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Accepted: 02/15/2022] [Indexed: 01/05/2023] Open
Abstract
We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 x 10-5) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotyping arrays, and the principal component loadings of genotypes. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance (Spearman's ⍴ = 0.61, p = 2.2 x 10-59 for quantitative traits, ⍴ = 0.21, p = 9.6 x 10-4 for binary traits). The sparse PRS model trained on European individuals showed limited transferability when evaluated on non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).
Collapse
|
18
|
|
19
|
Scalable logistic regression with crossed random effects. Electron J Stat 2022. [DOI: 10.1214/22-ejs2047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
20
|
Author Correction: An inflammatory aging clock (iAge) based on deep learning tracks multimorbidity, immunosenescence, frailty and cardiovascular aging. NATURE AGING 2021; 1:748. [PMID: 37117770 DOI: 10.1038/s43587-021-00102-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/30/2023]
|
21
|
An inflammatory aging clock (iAge) based on deep learning tracks multimorbidity, immunosenescence, frailty and cardiovascular aging. ACTA ACUST UNITED AC 2021; 1:598-615. [PMID: 34888528 PMCID: PMC8654267 DOI: 10.1038/s43587-021-00082-y] [Citation(s) in RCA: 153] [Impact Index Per Article: 51.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
While many diseases of aging have been linked to the immunological system, immune metrics capable of identifying the most at-risk individuals are lacking. From the blood immunome of 1,001 individuals aged 8-96 years, we developed a deep-learning method based on patterns of systemic age-related inflammation. The resulting inflammatory clock of aging (iAge) tracked with multimorbidity, immunosenescence, frailty and cardiovascular aging, and is also associated with exceptional longevity in centenarians. The strongest contributor to iAge was the chemokine CXCL9, which was involved in cardiac aging, adverse cardiac remodeling and poor vascular function. Furthermore, aging endothelial cells in human and mice show loss of function, cellular senescence and hallmark phenotypes of arterial stiffness, all of which are reversed by silencing CXCL9. In conclusion, we identify a key role of CXCL9 in age-related chronic inflammation and derive a metric for multimorbidity that can be utilized for the early detection of age-related clinical phenotypes.
Collapse
|
22
|
Corrigendum to: Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank. Biostatistics 2021; 23:683. [PMID: 34269393 DOI: 10.1093/biostatistics/kxab019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
23
|
Fast Numerical Optimization for Genome Sequencing Data in Population Biobanks. Bioinformatics 2021; 37:4148-4155. [PMID: 34146108 PMCID: PMC9206591 DOI: 10.1093/bioinformatics/btab452] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Revised: 06/08/2021] [Accepted: 06/15/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Large-scale and high-dimensional genome sequencing data poses computational challenges. General purpose optimization tools are usually not optimal in terms of computational and memory performance for genetic data. RESULTS We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0, 1, 2, NA}. We take advantage of this fact and use two bits to represent each entry in a genetic matrix, which reduces memory requirement by a factor of 32 compared to a double precision floating point representation. Using this representation, we implemented an iteratively reweighted least square algorithm to solve Lasso regressions on genetic matrices, which we name snpnet-2.0. When the dataset contains many rare variants, the predictors can be encoded in a sparse matrix. We utilize the sparsity in the predictor matrix to further reduce memory requirement and computational speed. Our sparse genetic matrix implementation uses both the compact 2-bit representation and a simplified version of compressed sparse block format so that matrix-vector multiplications can be effectively parallelized on multiple CPU cores. To demonstrate the effectiveness of this representation, we implement an accelerated proximal gradient method to solve group Lasso on these sparse genetic matrices. This solver is named sparse-snpnet, and will also be included as part of snpnet R package. Our implementation is able to solve Lasso and group Lasso, linear, logistic and Cox regression problems on sparse genetic matrices that contain 1,000,000 variants and almost 100,000 individuals within 10 minutes and using less than 32GB of memory. AVAILABILITY https://github.com/rivas-lab/snpnet/tree/compact.
Collapse
|
24
|
Wearable sensors enable personalized predictions of clinical laboratory measurements. Nat Med 2021; 27:1105-1112. [PMID: 34031607 DOI: 10.1038/s41591-021-01339-0] [Citation(s) in RCA: 77] [Impact Index Per Article: 25.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Accepted: 04/06/2021] [Indexed: 01/01/2023]
Abstract
Vital signs, including heart rate and body temperature, are useful in detecting or monitoring medical conditions, but are typically measured in the clinic and require follow-up laboratory testing for more definitive diagnoses. Here we examined whether vital signs as measured by consumer wearable devices (that is, continuously monitored heart rate, body temperature, electrodermal activity and movement) can predict clinical laboratory test results using machine learning models, including random forest and Lasso models. Our results demonstrate that vital sign data collected from wearables give a more consistent and precise depiction of resting heart rate than do measurements taken in the clinic. Vital sign data collected from wearables can also predict several clinical laboratory measurements with lower prediction error than predictions made using clinically obtained vital sign measurements. The length of time over which vital signs are monitored and the proximity of the monitoring period to the date of prediction play a critical role in the performance of the machine learning models. These results demonstrate the value of commercial wearable devices for continuous and longitudinal assessment of physiological measurements that today can be measured only with clinical laboratory tests.
Collapse
|
25
|
Assessment of heterogeneous treatment effect estimation accuracy via matching. Stat Med 2021; 40:3990-4013. [PMID: 33915600 DOI: 10.1002/sim.9010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Revised: 04/02/2021] [Accepted: 04/12/2021] [Indexed: 11/08/2022]
Abstract
We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo-observations of the HTE based on matching. Our contributions are three-fold: first, we introduce a novel matching distance derived from proximity scores in random forests; second, we formulate the matching problem as an average minimum-cost flow problem and provide an efficient algorithm; third, we propose a match-then-split principle for the assessment with cross-validation. We demonstrate the efficacy of the assessment approach using simulations and a real dataset.
Collapse
|
26
|
Survival Analysis on Rare Events Using Group-Regularized Multi-Response Cox Regression. Bioinformatics 2021; 37:4437-4443. [PMID: 33560296 PMCID: PMC8652035 DOI: 10.1093/bioinformatics/btab095] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 01/27/2021] [Accepted: 02/05/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The prediction performance of Cox proportional hazard model suffers when there are only few uncensored events in the training data. RESULTS We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is applicable when there is one or more other survival responses that 1. has a large number of observed events; 2. share a common set of associated predictors with the rare event response. This scenario is common in the UK Biobank (Sudlow et al., 2015) dataset where records for a large number of common and less prevalent diseases of the same set of individuals are available. By analyzing these responses together, we hope to achieve higher prediction performance than when they are analyzed individually. To make this approach practical for large-scale data, we developed an accelerated proximal gradient optimization algorithm as well as a screening procedure inspired by Qian et al. (2020). AVAILABILITY https://github.com/rivas-lab/multisnpnet-Cox. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
27
|
Polygenic risk modeling with latent trait-related genetic components. Eur J Hum Genet 2021; 29:1071-1081. [PMID: 33558700 DOI: 10.1038/s41431-021-00813-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2020] [Revised: 12/26/2020] [Accepted: 01/14/2021] [Indexed: 02/06/2023] Open
Abstract
Polygenic risk models have led to significant advances in understanding complex diseases and their clinical presentation. While polygenic risk scores (PRS) can effectively predict outcomes, they do not generally account for disease subtypes or pathways which underlie within-trait diversity. Here, we introduce a latent factor model of genetic risk based on components from Decomposition of Genetic Associations (DeGAs), which we call the DeGAs polygenic risk score (dPRS). We compute DeGAs using genetic associations for 977 traits and find that dPRS performs comparably to standard PRS while offering greater interpretability. We show how to decompose an individual's genetic risk for a trait across DeGAs components, with examples for body mass index (BMI) and myocardial infarction (heart attack) in 337,151 white British individuals in the UK Biobank, with replication in a further set of 25,486 non-British white individuals. We find that BMI polygenic risk factorizes into components related to fat-free mass, fat mass, and overall health indicators like physical activity. Most individuals with high dPRS for BMI have strong contributions from both a fat-mass component and a fat-free mass component, whereas a few "outlier" individuals have strong contributions from only one of the two components. Overall, our method enables fine-scale interpretation of the drivers of genetic risk for complex traits.
Collapse
|
28
|
Discussion of “Prediction, Estimation, and Attribution” by Bradley Efron. Int Stat Rev 2020. [DOI: 10.1111/insr.12414] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
29
|
Rejoinder: Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons. Stat Sci 2020. [DOI: 10.1214/20-sts733rej] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
30
|
Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons. Stat Sci 2020. [DOI: 10.1214/19-sts733] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
31
|
A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genet 2020; 16:e1009141. [PMID: 33095761 PMCID: PMC7641476 DOI: 10.1371/journal.pgen.1009141] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 11/04/2020] [Accepted: 09/04/2020] [Indexed: 11/18/2022] Open
Abstract
The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.
Collapse
|
32
|
Integration of mechanistic immunological knowledge into a machine learning pipeline improves predictions. NAT MACH INTELL 2020; 2:619-628. [PMID: 33294774 PMCID: PMC7720904 DOI: 10.1038/s42256-020-00232-8] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Accepted: 08/26/2020] [Indexed: 12/17/2022]
Abstract
The dense network of interconnected cellular signalling responses that are quantifiable in peripheral immune cells provides a wealth of actionable immunological insights. Although high-throughput single-cell profiling techniques, including polychromatic flow and mass cytometry, have matured to a point that enables detailed immune profiling of patients in numerous clinical settings, the limited cohort size and high dimensionality of data increase the possibility of false-positive discoveries and model overfitting. We introduce a generalizable machine learning platform, the immunological Elastic-Net (iEN), which incorporates immunological knowledge directly into the predictive models. Importantly, the algorithm maintains the exploratory nature of the high-dimensional dataset, allowing for the inclusion of immune features with strong predictive capabilities even if not consistent with prior knowledge. In three independent studies our method demonstrates improved predictions for clinically relevant outcomes from mass cytometry data generated from whole blood, as well as a large simulated dataset. The iEN is available under an open-source licence.
Collapse
|
33
|
Abstract
Breakthroughs in artificial intelligence (AI) hold enormous potential as it can automate complex tasks and go even beyond human performance. In their study, McKinney et al. showed the high potential of AI for breast cancer screening. However, the lack of methods’ details and algorithm code undermines its scientific value. Here, we identify obstacles hindering transparent and reproducible AI research as faced by McKinney et al., and provide solutions to these obstacles with implications for the broader field.
Collapse
|
34
|
Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank. Biostatistics 2020; 23:522-540. [PMID: 32989444 DOI: 10.1093/biostatistics/kxaa038] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Revised: 08/15/2020] [Accepted: 08/18/2020] [Indexed: 11/13/2022] Open
Abstract
We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the $L^1$-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do not fit in the memory. The output of our algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) or the validation deviance. To demonstrate the effectiveness of our algorithm, we analyze a large genotype-survival time dataset across 306 disease outcomes from the UK Biobank (Sudlow and others, 2015). We provide a publicly available implementation of the proposed approach for genetics data on top of the PLINK2 package and name it snpnet-Cox.
Collapse
|
35
|
|
36
|
Projected geographic disparities in healthcare worker absenteeism from COVID-19 school closures and the economic feasibility of child care subsidies: a simulation study. BMC Med 2020; 18:218. [PMID: 32664927 PMCID: PMC7360472 DOI: 10.1186/s12916-020-01692-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/08/2020] [Accepted: 07/01/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND School closures have been enacted as a measure of mitigation during the ongoing coronavirus disease 2019 (COVID-19) pandemic. It has been shown that school closures could cause absenteeism among healthcare workers with dependent children, but there remains a need for spatially granular analyses of the relationship between school closures and healthcare worker absenteeism to inform local community preparedness. METHODS We provide national- and county-level simulations of school closures and unmet child care needs across the USA. We develop individual simulations using county-level demographic and occupational data, and model school closure effectiveness with age-structured compartmental models. We perform multivariate quasi-Poisson ecological regressions to find associations between unmet child care needs and COVID-19 vulnerability factors. RESULTS At the national level, we estimate the projected rate of unmet child care needs for healthcare worker households to range from 7.4 to 8.7%, and the effectiveness of school closures as a 7.6% and 8.4% reduction in fewer hospital and intensive care unit (ICU) beds, respectively, at peak demand when varying across initial reproduction number estimates by state. At the county level, we find substantial variations of projected unmet child care needs and school closure effects, 9.5% (interquartile range (IQR) 8.2-10.9%) of healthcare worker households and 5.2% (IQR 4.1-6.5%) and 6.8% (IQR 4.8-8.8%) reduction in fewer hospital and ICU beds, respectively, at peak demand. We find significant positive associations between estimated levels of unmet child care needs and diabetes prevalence, county rurality, and race (p<0.05). We estimate costs of absenteeism and child care and observe from our models that an estimated 76.3 to 96.8% of counties would find it less expensive to provide child care to all healthcare workers with children than to bear the costs of healthcare worker absenteeism during school closures. CONCLUSIONS School closures are projected to reduce peak ICU and hospital demand, but could disrupt healthcare systems through absenteeism, especially in counties that are already particularly vulnerable to COVID-19. Child care subsidies could help circumvent the ostensible trade-off between school closures and healthcare worker absenteeism.
Collapse
|
37
|
Perioperative analgesic administration during the 2018 parenteral opioid shortage in the United States - A retrospective analysis. J Clin Anesth 2020; 66:109892. [PMID: 32502773 DOI: 10.1016/j.jclinane.2020.109892] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 04/15/2020] [Accepted: 05/19/2020] [Indexed: 10/24/2022]
|
38
|
Molecular Transducers of Physical Activity Consortium (MoTrPAC): Mapping the Dynamic Responses to Exercise. Cell 2020; 181:1464-1474. [DOI: 10.1016/j.cell.2020.06.004] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Revised: 05/19/2020] [Accepted: 06/01/2020] [Indexed: 12/31/2022]
|
39
|
Projected geographic disparities in healthcare worker absenteeism from COVID-19 school closures and the economic feasibility of child care subsidies: a simulation study. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2020:2020.03.19.20039404. [PMID: 32511455 PMCID: PMC7239083 DOI: 10.1101/2020.03.19.20039404] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Background School closures have been enacted as a measure of mitigation during the ongoing COVID-19 pandemic. It has been shown that school closures could cause absenteeism amongst healthcare workers with dependent children, but there remains a need for spatially granular analyses of the relationship between school closures and healthcare worker absenteeism to inform local community preparedness. Methods We provide national- and county-level simulations of school closures and unmet child care needs across the United States. We develop individual simulations using county-level demographic and occupational data, and model school closure effectiveness with age-structured compartmental models. We perform multivariate quasi-Poisson ecological regressions to find associations between unmet child care needs and COVID-19 vulnerability factors. Results At the national level, we estimate the projected rate of unmet child care needs for healthcare worker households to range from 7.5% to 8.6%, and the effectiveness of school closures to range from 3.2% (R0 = 4) to 7.2% (R0 = 2) reduction in fewer ICU beds at peak demand. At the county-level, we find substantial variations of projected unmet child care needs and school closure effects, ranging from 1.9% to 18.3% of healthcare worker households and 5.7% to 8.8% reduction in fewer ICU beds at peak demand (R0 = 2). We find significant positive associations between estimated levels of unmet child care needs and diabetes prevalence, county rurality, and race (p < 0.05). We estimate costs of absenteeism and child care and observe from our models that an estimated 71.1% to 98.8% of counties would find it less expensive to provide child care to all healthcare workers with children than to bear the costs of healthcare worker absenteeism during school closures. Conclusions School closures are projected to reduce peak ICU bed demand, but could disrupt healthcare systems through absenteeism, especially in counties that are already particularly vulnerable to COVID-19. Child care subsidies could help circumvent the ostensible tradeoff between school closures and healthcare worker absenteeism.
Collapse
|
40
|
Discussion of “Prediction, Estimation, and Attribution” by Bradley Efron. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1762617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
41
|
Decreasing human body temperature in the United States since the industrial revolution. eLife 2020; 9:49555. [PMID: 31908267 PMCID: PMC6946399 DOI: 10.7554/elife.49555] [Citation(s) in RCA: 71] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Accepted: 12/01/2019] [Indexed: 01/21/2023] Open
Abstract
In the US, the normal, oral temperature of adults is, on average, lower than the canonical 37°C established in the 19th century. We postulated that body temperature has decreased over time. Using measurements from three cohorts—the Union Army Veterans of the Civil War (N = 23,710; measurement years 1860–1940), the National Health and Nutrition Examination Survey I (N = 15,301; 1971–1975), and the Stanford Translational Research Integrated Database Environment (N = 150,280; 2007–2017)—we determined that mean body temperature in men and women, after adjusting for age, height, weight and, in some models date and time of day, has decreased monotonically by 0.03°C per birth decade. A similar decline within the Union Army cohort as between cohorts, makes measurement error an unlikely explanation. This substantive and continuing shift in body temperature—a marker for metabolic rate—provides a framework for understanding changes in human health and longevity over 157 years.
Collapse
|
42
|
Abstract
PURPOSE The preoperative distinction between uterine leiomyoma (LM) and leiomyosarcoma (LMS) is difficult, which may result in dissemination of an unexpected malignancy during surgery for a presumed benign lesion. An assay based on circulating tumor DNA (ctDNA) could help in the preoperative distinction between LM and LMS. This study addresses the feasibility of applying the two most frequently used approaches for detection of ctDNA: profiling of copy number alterations (CNAs) and point mutations in the plasma of patients with LM. PATIENTS AND METHODS By shallow whole-genome sequencing, we prospectively examined whether LM-derived ctDNA could be detected in plasma specimens of 12 patients. Plasma levels of lactate dehydrogenase, a marker suggested for the distinction between LM and LMS by prior studies, were also determined. We also profiled 36 LM tumor specimens by exome sequencing to develop a panel for targeted detection of point mutations in ctDNA of patients with LM. RESULTS We identified tumor-derived CNAs in the plasma DNA of 50% (six of 12) of patients with LM. The lactate dehydrogenase levels did not allow for an accurate distinction between patients with LM and patients with LMS. We identified only two recurrently mutated genes in LM tumors (MED12 and ACLY). CONCLUSION Our results show that LMs do shed DNA into the circulation, which provides an opportunity for the development of ctDNA-based testing to distinguish LM from LMS. Although we could not design an LM-specific panel for ctDNA profiling, we propose that the detection of CNAs or point mutations in selected tumor suppressor genes in ctDNA may favor a diagnosis of LMS, since these genes are not affected in LM.
Collapse
|
43
|
A clinico-genomic analysis of soft tissue sarcoma patients reveals CDKN2A deletion as a biomarker for poor prognosis. Clin Sarcoma Res 2019; 9:12. [PMID: 31528332 PMCID: PMC6739971 DOI: 10.1186/s13569-019-0122-5] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2019] [Accepted: 08/28/2019] [Indexed: 01/19/2023] Open
Abstract
Background Sarcomas are a rare, heterogeneous group of tumors with variable tendencies for aggressive behavior. Molecular markers for prognosis are needed to risk stratify patients and identify those who might benefit from more intensive therapeutic strategies. Patients and methods We analyzed somatic tumor genomic profiles and clinical outcomes of 152 soft tissue (STS) and bone sarcoma (BS) patients sequenced at Stanford Cancer Institute as well as 206 STS patients from The Cancer Genome Atlas. Genomic profiles of 7733 STS from the Foundation Medicine database were used to assess the frequency of CDKN2A alterations in histological subtypes of sarcoma. Results Compared to all other tumor types, sarcomas were found to carry the highest relative percentage of gene amplifications/deletions/fusions and the lowest average mutation count. The most commonly altered genes in STS were TP53 (47%), CDKN2A (22%), RB1 (22%), NF1 (11%), and ATRX (11%). When all genomic alterations were tested for prognostic significance in the specific Stanford cohort of localized STS, only CDKN2A alterations correlated significantly with prognosis, with a hazard ratio (HR) of 2.83 for overall survival (p = 0.017). These findings were validated in the TCGA dataset where CDKN2A altered patients had significantly worse overall survival with a HR of 2.7 (p = 0.002). Analysis of 7733 STS patients from Foundation One showed high prevalence of CDKN2A alterations in malignant peripheral nerve sheath tumors, myxofibrosarcomas, and undifferentiated pleomorphic sarcomas. Conclusion Our clinico-genomic profiling of STS shows that CDKN2A deletion was the most prevalent DNA copy number aberration and was associated with poor prognosis.
Collapse
|
44
|
CAUSAL INTERPRETATIONS OF BLACK-BOX MODELS. JOURNAL OF BUSINESS & ECONOMIC STATISTICS : A PUBLICATION OF THE AMERICAN STATISTICAL ASSOCIATION 2019; 2019:10.1080/07350015.2019.1624293. [PMID: 33132490 PMCID: PMC7597863 DOI: 10.1080/07350015.2019.1624293] [Citation(s) in RCA: 79] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
The fields of machine learning and causal inference have developed many concepts, tools, and theory that are potentially useful for each other. Through exploring the possibility of extracting causal interpretations from black-box machine-trained models, we briefly review the languages and concepts in causal inference that may be interesting to machine learning researchers. We start with the curious observation that Friedman's partial dependence plot has exactly the same formula as Pearl's back-door adjustment and discuss three requirements to make causal interpretations: a model with good predictive performance, some domain knowledge in the form of a causal diagram and suitable visualization tools. We provide several illustrative examples and find some interesting and potentially causal relations using visualization tools for black-box models.
Collapse
|
45
|
Association of cardiovascular events and lipoprotein particle size: Development of a risk score based on functional data analysis. PLoS One 2019; 14:e0213172. [PMID: 30845215 PMCID: PMC6405139 DOI: 10.1371/journal.pone.0213172] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2018] [Accepted: 02/17/2019] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND Functional data is data represented by functions (curves or surfaces of a low-dimensional index). Functional data often arise when measurements are collected over time or across locations. In the field of medicine, plasma lipoprotein particles can be quantified according to particle diameter by ion mobility. GOAL We wanted to evaluate the utility of functional analysis for assessing the association of plasma lipoprotein size distribution with cardiovascular disease after adjustment for established risk factors including standard lipids. METHODS We developed a model to predict risk of cardiovascular disease among participants in a case-cohort study of the Malmö Prevention Project. We used a linear model with 311 coefficients, corresponding to measures of lipoprotein mass at each of 311 diameters, and assumed these coefficients varied smoothly along the diameter index. The smooth function was represented as an expansion of natural cubic splines where the smoothness parameter was chosen by assessment of a series of nested splines. Cox proportional hazards models of time to a first cardiovascular disease event were used to estimate the smooth coefficient function among a training set consisting of one half of the participants. The resulting model was used to calculate a functional risk score for the remaining half of the participants (test set) and its association with events was assessed in Cox models that adjusted for traditional cardiovascular risk factors. RESULTS In the test set, participants with a functional risk score in the highest quartile were found to be at increased risk of cardiovascular events compared with the lowest quartile (Hazard ratio = 1.34; 95% Confidence Interval: 1.05 to 1.70) after adjustment for established risk factors. CONCLUSION In an independent test set of Malmö Prevention Project participants, the functional risk score was found to be associated with cardiovascular events after adjustment for traditional risk factors including standard lipids.
Collapse
|
46
|
|
47
|
|
48
|
|
49
|
Saturating Splines and Feature Selection. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2018; 18:197. [PMID: 31007630 PMCID: PMC6474379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
We extend the adaptive regression spline model by incorporating saturation, the natural requirement that a function extend as a constant outside a certain range. We fit saturating splines to data via a convex optimization problem over a space of measures, which we solve using an efficient algorithm based on the conditional gradient method. Unlike many existing approaches, our algorithm solves the original infinite-dimensional (for splines of degree at least two) optimization problem without pre-specified knot locations. We then adapt our algorithm to fit generalized additive models with saturating splines as coordinate functions and show that the saturation requirement allows our model to simultaneously perform feature selection and nonlinear function fitting. Finally, we briefly sketch how the method can be extended to higher order splines and to different requirements on the extension outside the data range.
Collapse
|
50
|
Gene expression profiling of low-grade endometrial stromal sarcoma indicates fusion protein-mediated activation of the Wnt signaling pathway. Gynecol Oncol 2018; 149:388-393. [PMID: 29544705 DOI: 10.1016/j.ygyno.2018.03.007] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2018] [Revised: 03/03/2018] [Accepted: 03/07/2018] [Indexed: 12/12/2022]
Abstract
OBJECTIVE Low-grade endometrial stromal sarcomas (LGESS) harbor chromosomal translocations that affect proteins associated with chromatin remodeling Polycomb Repressive Complex 2 (PRC2), including SUZ12, PHF1 and EPC1. Roughly half of LGESS also demonstrate nuclear accumulation of β-catenin, which is a hallmark of Wnt signaling activation. However, the targets affected by the fusion proteins and the role of Wnt signaling in the pathogenesis of these tumors remain largely unknown. METHODS Here we report the results of a meta-analysis of three independent gene expression profiling studies on LGESS and immunohistochemical evaluation of nuclear expression of β-catenin and Lef1 in 112 uterine sarcoma specimens obtained from 20 LGESS and 89 LMS patients. RESULTS Our results demonstrate that 143 out of 310 genes overexpressed in LGESS are known to be directly regulated by SUZ12. In addition, our gene expression meta-analysis shows activation of multiple genes implicated in Wnt signaling. We further emphasize the role of the Wnt signaling pathway by demonstrating concordant nuclear expression of β-catenin and Lef1 in 7/16 LGESS. CONCLUSIONS Based on our findings, we suggest that LGESS-specific fusion proteins disrupt the repressive function of the PRC2 complex similar to the mechanism seen in synovial sarcoma, where the SS18-SSX fusion proteins disrupt the mSWI/SNF (BAF) chromatin remodeling complex. We propose that these fusion proteins in LGESS contribute to overexpression of Wnt ligands with subsequent activation of Wnt signaling pathway and formation of an active β-catenin/Lef1 transcriptional complex. These observations could lead to novel therapeutic approaches that focus on the Wnt pathway in LGESS.
Collapse
|