1
|
An omics-based machine learning approach to predict diabetes progression: a RHAPSODY study. Diabetologia 2024; 67:885-894. [PMID: 38374450 PMCID: PMC10954972 DOI: 10.1007/s00125-024-06105-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 01/05/2024] [Indexed: 02/21/2024]
Abstract
AIMS/HYPOTHESIS People with type 2 diabetes are heterogeneous in their disease trajectory, with some progressing more quickly to insulin initiation than others. Although classical biomarkers such as age, HbA1c and diabetes duration are associated with glycaemic progression, it is unclear how well such variables predict insulin initiation or requirement and whether newly identified markers have added predictive value. METHODS In two prospective cohort studies as part of IMI-RHAPSODY, we investigated whether clinical variables and three types of molecular markers (metabolites, lipids, proteins) can predict time to insulin requirement using different machine learning approaches (lasso, ridge, GRridge, random forest). Clinical variables included age, sex, HbA1c, HDL-cholesterol and C-peptide. Models were run with unpenalised clinical variables (i.e. always included in the model without weights) or penalised clinical variables, or without clinical variables. Model development was performed in one cohort and the model was applied in a second cohort. Model performance was evaluated using Harrel's C statistic. RESULTS Of the 585 individuals from the Hoorn Diabetes Care System (DCS) cohort, 69 required insulin during follow-up (1.0-11.4 years); of the 571 individuals in the Genetics of Diabetes Audit and Research in Tayside Scotland (GoDARTS) cohort, 175 required insulin during follow-up (0.3-11.8 years). Overall, the clinical variables and proteins were selected in the different models most often, followed by the metabolites. The most frequently selected clinical variables were HbA1c (18 of the 36 models, 50%), age (15 models, 41.2%) and C-peptide (15 models, 41.2%). Base models (age, sex, BMI, HbA1c) including only clinical variables performed moderately in both the DCS discovery cohort (C statistic 0.71 [95% CI 0.64, 0.79]) and the GoDARTS replication cohort (C 0.71 [95% CI 0.69, 0.75]). A more extensive model including HDL-cholesterol and C-peptide performed better in both cohorts (DCS, C 0.74 [95% CI 0.67, 0.81]; GoDARTS, C 0.73 [95% CI 0.69, 0.77]). Two proteins, lactadherin and proto-oncogene tyrosine-protein kinase receptor, were most consistently selected and slightly improved model performance. CONCLUSIONS/INTERPRETATION Using machine learning approaches, we show that insulin requirement risk can be modestly well predicted by predominantly clinical variables. Inclusion of molecular markers improves the prognostic performance beyond that of clinical variables by up to 5%. Such prognostic models could be useful for identifying people with diabetes at high risk of progressing quickly to treatment intensification. DATA AVAILABILITY Summary statistics of lipidomic, proteomic and metabolomic data are available from a Shiny dashboard at https://rhapdata-app.vital-it.ch .
Collapse
|
2
|
Penalized regression with multiple sources of prior effects. Bioinformatics 2023; 39:btad680. [PMID: 37951587 PMCID: PMC10699841 DOI: 10.1093/bioinformatics/btad680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 10/19/2023] [Accepted: 11/08/2023] [Indexed: 11/14/2023] Open
Abstract
MOTIVATION In many high-dimensional prediction or classification tasks, complementary data on the features are available, e.g. prior biological knowledge on (epi)genetic markers. Here we consider tasks with numerical prior information that provide an insight into the importance (weight) and the direction (sign) of the feature effects, e.g. regression coefficients from previous studies. RESULTS We propose an approach for integrating multiple sources of such prior information into penalized regression. If suitable co-data are available, this improves the predictive performance, as shown by simulation and application. AVAILABILITY AND IMPLEMENTATION The proposed method is implemented in the R package transreg (https://github.com/lcsb-bds/transreg, https://cran.r-project.org/package=transreg).
Collapse
|
3
|
ecpc: an R-package for generic co-data models for high-dimensional prediction. BMC Bioinformatics 2023; 24:172. [PMID: 37101151 PMCID: PMC10134536 DOI: 10.1186/s12859-023-05289-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2022] [Accepted: 04/12/2023] [Indexed: 04/28/2023] Open
Abstract
BACKGROUND High-dimensional prediction considers data with more variables than samples. Generic research goals are to find the best predictor or to select variables. Results may be improved by exploiting prior information in the form of co-data, providing complementary data not on the samples, but on the variables. We consider adaptive ridge penalised generalised linear and Cox models, in which the variable-specific ridge penalties are adapted to the co-data to give a priori more weight to more important variables. The R-package ecpc originally accommodated various and possibly multiple co-data sources, including categorical co-data, i.e. groups of variables, and continuous co-data. Continuous co-data, however, were handled by adaptive discretisation, potentially inefficiently modelling and losing information. As continuous co-data such as external p values or correlations often arise in practice, more generic co-data models are needed. RESULTS Here, we present an extension to the method and software for generic co-data models, particularly for continuous co-data. At the basis lies a classical linear regression model, regressing prior variance weights on the co-data. Co-data variables are then estimated with empirical Bayes moment estimation. After placing the estimation procedure in the classical regression framework, extension to generalised additive and shape constrained co-data models is straightforward. Besides, we show how ridge penalties may be transformed to elastic net penalties. In simulation studies we first compare various co-data models for continuous co-data from the extension to the original method. Secondly, we compare variable selection performance to other variable selection methods. The extension is faster than the original method and shows improved prediction and variable selection performance for non-linear co-data relations. Moreover, we demonstrate use of the package in several genomics examples throughout the paper. CONCLUSIONS The R-package ecpc accommodates linear, generalised additive and shape constrained additive co-data models for the purpose of improved high-dimensional prediction and variable selection. The extended version of the package as presented here (version number 3.1.1 and higher) is available on ( https://cran.r-project.org/web/packages/ecpc/ ).
Collapse
|
4
|
Radiological correlates of episodes of acute decline in the leukodystrophy vanishing white matter. Neuroradiology 2023; 65:855-863. [PMID: 36574026 DOI: 10.1007/s00234-022-03097-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Accepted: 11/25/2022] [Indexed: 12/28/2022]
Abstract
PURPOSE Patients with vanishing white matter (VWM) experience unremitting chronic neurological decline and stress-provoked episodes of rapid, partially reversible decline. Cerebral white matter abnormalities are progressive, without improvement, and are therefore unlikely to be related to the episodes. We determined which radiological findings are related to episodic decline. METHODS MRI scans of VWM patients were retrospectively analyzed. Patients were grouped into A (never episodes) and B (episodes). Signal abnormalities outside the cerebral white matter were rated as absent, mild, or severe. A sum score was developed with abnormalities only seen in group B. The temporal relationship between signal abnormalities and episodes was determined by subdividing scans into those made before, less than 3 months after, and more than 3 months after onset of an episode. RESULTS Five hundred forty-three examinations of 298 patients were analyzed. Mild and severe signal abnormalities in the caudate nucleus, putamen, globus pallidus, thalamus, midbrain, medulla oblongata, and severe signal abnormalities in the pons were only seen in group B. The sum score, constructed with these abnormalities, depended on the timing of the scan (χ2(2, 400) = 22.8; p < .001): it was least often abnormal before, most often abnormal with the highest value shortly after, and lower longer than 3 months after an episode. CONCLUSION In VWM, signal abnormalities in brainstem, thalamus, and basal ganglia are related to episodic decline and can improve. Knowledge of the natural MRI history in VWM is important for clinical interpretation of MRI findings and crucial in therapy trials.
Collapse
|
5
|
Magnetic resonance imaging based radiomics prediction of Human Papillomavirus infection status and overall survival in oropharyngeal squamous cell carcinoma. Oral Oncol 2023; 137:106307. [PMID: 36657208 DOI: 10.1016/j.oraloncology.2023.106307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Revised: 12/28/2022] [Accepted: 01/08/2023] [Indexed: 01/19/2023]
Abstract
OBJECTIVES Human papillomavirus- (HPV) positive oropharyngeal squamous cell carcinoma (OPSCC) differs biologically and clinically from HPV-negative OPSCC and has a better prognosis. This study aims to analyze the value of magnetic resonance imaging (MRI)-based radiomics in predicting HPV status in OPSCC and aims to develop a prognostic model in OPSCC including HPV status and MRI-based radiomics. MATERIALS AND METHODS Manual delineation of 249 primary OPSCCs (91 HPV-positive and 159 HPV-negative) on pretreatment native T1-weighted MRIs was performed and used to extract 498 radiomic features per delineation. A logistic regression (LR) and random forest (RF) model were developed using univariate feature selection. Additionally, factor analysis was performed, and the derived factors were combined with clinical data in a predictive model to assess the performance on predicting HPV status. Additionally, factors were combined with clinical parameters in a multivariable survival regression analysis. RESULTS Both feature-based LR and RF models performed with an AUC of 0.79 in prediction of HPV status. Fourteen of the twenty most significant features were similar in both models, mainly concerning tumor sphericity, intensity variation, compactness, and tumor diameter. The model combining clinical data and radiomic factors (AUC = 0.89) outperformed the radiomics-only model in predicting OPSCC HPV status. Overall survival prediction was most accurate using the combination of clinical parameters and radiomic factors (C-index = 0.72). CONCLUSION Predictive models based on MR-radiomic features were able to predict HPV status with sufficient performance, supporting the role of MRI-based radiomics as potential imaging biomarker. Survival prediction improved by combining clinical features with MRI-based radiomics.
Collapse
|
6
|
Fast Marginal Likelihood Estimation of Penalties for Group-Adaptive Elastic Net. J Comput Graph Stat 2022; 32:950-960. [PMID: 38013849 PMCID: PMC10511031 DOI: 10.1080/10618600.2022.2128809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 09/12/2022] [Indexed: 10/10/2022]
Abstract
Elastic net penalization is widely used in high-dimensional prediction and variable selection settings. Auxiliary information on the variables, for example, groups of variables, is often available. Group-adaptive elastic net penalization exploits this information to potentially improve performance by estimating group penalties, thereby penalizing important groups of variables less than other groups. Estimating these group penalties is, however, hard due to the high dimension of the data. Existing methods are computationally expensive or not generic in the type of response. Here we present a fast method for estimation of group-adaptive elastic net penalties for generalized linear models. We first derive a low-dimensional representation of the Taylor approximation of the marginal likelihood for group-adaptive ridge penalties, to efficiently estimate these penalties. Then we show by using asymptotic normality of the linear predictors that this marginal likelihood approximates that of elastic net models. The ridge group penalties are then transformed to elastic net group penalties by matching the ridge prior variance to the elastic net prior variance as function of the group penalties. The method allows for overlapping groups and unpenalized variables, and is easily extended to other penalties. For a model-based simulation study and two cancer genomics applications we demonstrate a substantially decreased computation time and improved or matching performance compared to other methods. Supplementary materials for this article are available online.
Collapse
|
7
|
Semi-supervised empirical Bayes group-regularized factor regression. Biom J 2022; 64:1289-1306. [PMID: 35730912 PMCID: PMC9796498 DOI: 10.1002/bimj.202100105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 03/16/2022] [Accepted: 03/20/2022] [Indexed: 01/01/2023]
Abstract
The features in a high-dimensional biomedical prediction problem are often well described by low-dimensional latent variables (or factors). We use this to include unlabeled features and additional information on the features when building a prediction model. Such additional feature information is often available in biomedical applications. Examples are annotation of genes, metabolites, or p-values from a previous study. We employ a Bayesian factor regression model that jointly models the features and the outcome using Gaussian latent variables. We fit the model using a computationally efficient variational Bayes method, which scales to high dimensions. We use the extra information to set up a prior model for the features in terms of hyperparameters, which are then estimated through empirical Bayes. The method is demonstrated in simulations and two applications. One application considers influenza vaccine efficacy prediction based on microarray data. The second application predicts oral cancer metastasis from RNAseq data.
Collapse
|
8
|
Estimation of predictive performance in high-dimensional data settings using learning curves. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107622] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
9
|
Estimation of variance components, heritability and the ridge penalty in high-dimensional generalized linear models. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2019.1646760] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
10
|
Predicting patient response with models trained on cell lines and patient-derived xenografts by nonlinear transfer learning. Proc Natl Acad Sci U S A 2021; 118:e2106682118. [PMID: 34873056 PMCID: PMC8670522 DOI: 10.1073/pnas.2106682118] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/18/2021] [Indexed: 12/13/2022] Open
Abstract
Preclinical models have been the workhorse of cancer research, producing massive amounts of drug response data. Unfortunately, translating response biomarkers derived from these datasets to human tumors has proven to be particularly challenging. To address this challenge, we developed TRANSACT, a computational framework that builds a consensus space to capture biological processes common to preclinical models and human tumors and exploits this space to construct drug response predictors that robustly transfer from preclinical models to human tumors. TRANSACT performs favorably compared to four competing approaches, including two deep learning approaches, on a set of 23 drug prediction challenges on The Cancer Genome Atlas and 226 metastatic tumors from the Hartwig Medical Foundation. We demonstrate that response predictions deliver a robust performance for a number of therapies of high clinical importance: platinum-based chemotherapies, gemcitabine, and paclitaxel. In contrast to other approaches, we demonstrate the interpretability of the TRANSACT predictors by correctly identifying known biomarkers of targeted therapies, and we propose potential mechanisms that mediate the resistance to two chemotherapeutic agents.
Collapse
|
11
|
Flexible co-data learning for high-dimensional prediction. Stat Med 2021; 40:5910-5925. [PMID: 34438466 PMCID: PMC9292202 DOI: 10.1002/sim.9162] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/18/2021] [Accepted: 07/29/2021] [Indexed: 02/06/2023]
Abstract
Clinical research often focuses on complex traits in which many variables play a role in mechanisms driving, or curing, diseases. Clinical prediction is hard when data is high-dimensional, but additional information, like domain knowledge and previously published studies, may be helpful to improve predictions. Such complementary data, or co-data, provide information on the covariates, such as genomic location or P-values from external studies. We use multiple and various co-data to define possibly overlapping or hierarchically structured groups of covariates. These are then used to estimate adaptive multi-group ridge penalties for generalized linear and Cox models. Available group adaptive methods primarily target for settings with few groups, and therefore likely overfit for non-informative, correlated or many groups, and do not account for known structure on group level. To handle these issues, our method combines empirical Bayes estimation of the hyperparameters with an extra level of flexible shrinkage. This renders a uniquely flexible framework as any type of shrinkage can be used on the group level. We describe various types of co-data and propose suitable forms of hypershrinkage. The method is very versatile, as it allows for integration and weighting of multiple co-data sets, inclusion of unpenalized covariates and posterior variable selection. For three cancer genomics applications we demonstrate improvements compared to other models in terms of performance, variable selection stability and validation.
Collapse
|
12
|
Bayesian log-normal deconvolution for enhanced in silico microdissection of bulk gene expression data. Nat Commun 2021; 12:6106. [PMID: 34671028 PMCID: PMC8528834 DOI: 10.1038/s41467-021-26328-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Accepted: 09/27/2021] [Indexed: 01/29/2023] Open
Abstract
Deconvolution of bulk gene expression profiles into the cellular components is pivotal to portraying tissue's complex cellular make-up, such as the tumor microenvironment. However, the inherently variable nature of gene expression requires a comprehensive statistical model and reliable prior knowledge of individual cell types that can be obtained from single-cell RNA sequencing. We introduce BLADE (Bayesian Log-normAl Deconvolution), a unified Bayesian framework to estimate both cellular composition and gene expression profiles for each cell type. Unlike previous comprehensive statistical approaches, BLADE can handle > 20 types of cells due to the efficient variational inference. Throughout an intensive evaluation with > 700 simulated and real datasets, BLADE demonstrated enhanced robustness against gene expression variability and better completeness than conventional methods, in particular, to reconstruct gene expression profiles of each cell type. In summary, BLADE is a powerful tool to unravel heterogeneous cellular activity in complex biological systems from standard bulk gene expression data.
Collapse
|
13
|
Abstract
High levels of methylated DNA in urine represent an emerging biomarker for non-small cell lung cancer (NSCLC) detection and are the subject of ongoing research. This study aimed to investigate the circadian variation of urinary cell-free DNA (cfDNA) abundance and methylation levels of cancer-associated genes in NSCLC patients. In this prospective study of 23 metastatic NSCLC patients with active disease, patients were asked to collect six urine samples during the morning, afternoon, and evening of two subsequent days. Urinary cfDNA concentrations and methylation levels of CDO1, SOX17, and TAC1 were measured at each time point. Circadian variation and between- and within-subject variability were assessed using linear mixed models. Variability was estimated using the Intraclass Correlation Coefficient (ICC), representing reproducibility. No clear circadian patterns could be recognized for cfDNA concentrations or methylation levels across the different sampling time points. Significantly lower cfDNA concentrations were found in males (p=0.034). For cfDNA levels, the between- and within-subject variability were comparable, rendering an ICC of 0.49. For the methylation markers, ICCs varied considerably, ranging from 0.14 to 0.74. Test reproducibility could be improved by collecting multiple samples per patient. In conclusion, there is no preferred collection time for NSCLC detection in urine using methylation markers, but single measurements should be interpreted carefully, and serial sampling may increase test performance. This study contributes to the limited understanding of cfDNA dynamics in urine and the continued interest in urine-based liquid biopsies for cancer diagnostics.
Collapse
|
14
|
A panel of DNA methylation markers for the classification of consensus molecular subtypes 2 and 3 in patients with colorectal cancer. Mol Oncol 2021; 15:3348-3362. [PMID: 34510716 PMCID: PMC8637568 DOI: 10.1002/1878-0261.13098] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Revised: 08/04/2021] [Accepted: 09/09/2021] [Indexed: 12/25/2022] Open
Abstract
Consensus molecular subtypes (CMSs) can guide precision treatment of colorectal cancer (CRC). We aim to identify methylation markers to distinguish between CMS2 and CMS3 in patients with CRC, for which an easy test is currently lacking. To this aim, fresh‐frozen tumor tissue of 239 patients with stage I‐III CRC was analyzed. Methylation profiles were obtained using the Infinium HumanMethylation450 BeadChip. We performed adaptive group‐regularized logistic ridge regression with post hoc group‐weighted elastic net marker selection to build prediction models for classification of CMS2 and CMS3. The Cancer Genome Atlas (TCGA) data were used for validation. Group regularization of the probes was done based on their location either relative to a CpG island or relative to a gene present in the CMS classifier, resulting in two different prediction models and subsequently different marker panels. For both panels, even when using only five markers, accuracies were > 90% in our cohort and in the TCGA validation set. Our methylation marker panel accurately distinguishes between CMS2 and CMS3. This enables development of a targeted assay to provide a robust and clinically relevant classification tool for CRC patients.
Collapse
|
15
|
Predictive and interpretable models via the stacked elastic net. Bioinformatics 2021; 37:2012-2016. [PMID: 32437519 PMCID: PMC8336997 DOI: 10.1093/bioinformatics/btaa535] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Revised: 04/30/2020] [Accepted: 05/18/2020] [Indexed: 12/18/2022] Open
Abstract
Motivation Machine learning in the biomedical sciences should ideally provide predictive
and interpretable models. When predicting outcomes from clinical or
molecular features, applied researchers often want to know which features
have effects, whether these effects are positive or negative and how strong
these effects are. Regression analysis includes this information in the
coefficients but typically renders less predictive models than more advanced
machine learning techniques. Results Here, we propose an interpretable meta-learning approach for high-dimensional
regression. The elastic net provides a compromise between estimating weak
effects for many features and strong effects for some features. It has a
mixing parameter to weight between ridge and lasso regularization. Instead
of selecting one weighting by tuning, we combine multiple weightings by
stacking. We do this in a way that increases predictivity without
sacrificing interpretability. Availability and implementation The R package starnet is available on GitHub
(https://github.com/rauschenberger/starnet) and CRAN
(https://CRAN.R-project.org/package=starnet).
Collapse
|
16
|
P-79 Molecular Characterization of Locally Relapsed Head and Neck Cancer after Concomitant Chemoradiotherapy. Oral Oncol 2021. [DOI: 10.1016/s1368-8375(21)00366-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
17
|
A panel of DNA methylation markers for the classification of consensus molecular subtypes 2 and 3 in patients with colorectal cancer. J Clin Oncol 2021. [DOI: 10.1200/jco.2021.39.15_suppl.3545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
3545 Background: Consensus molecular subtypes (CMSs) can guide precision treatment of colorectal cancer (CRC). Currently available assays can identify CMS1 and CMS4 cases well, while a dedicated test to distinguish CMS2 and 3 is lacking. This study aimed to identify a panel of methylation markers to distinguish between CMS2 and 3 in patients with CRC. Methods: Fresh-frozen tumor tissue of 239 patients with stage I-III CRC was included. CMS classification was performed on RNA-seq data using the single-sample-prediction parameter from the “CMSclassifier” package. Methylation profiles were obtained using the Infinium HumanMethylation450 BeadChip. We performed adaptive group-regularised logistic ridge-regression with post-hoc group-weighted elastic net marker selection to build prediction models for classification of CMS2 and CMS3 based on 15, 10 or 5 markers. Data from TCGAwas used for validation. Results: Overall methylation profiles differed between CMS2 and CMS3. Group-regularisation of the probes was done based on their location either relative to a CpG island or relative to a gene present in the CMS classifier resulting in two different prediction models and subsequently different marker panels. For both panels, even when using only 5 markers, sensitivity, specificity, and accuracy were > 90%. Validation showed comparable performances. Conclusions: Our highly sensitive and specific methylation marker panel can be used to distinguish CMS2 and 3. This enables development of a qPCR DNA methylation assay in patients with CRC to provide a specific and non-invasive classification tool.
Collapse
|
18
|
|
19
|
Drug sensitivity prediction with normal inverse Gaussian shrinkage informed by external data. Biom J 2020; 63:289-304. [PMID: 33155717 PMCID: PMC7891636 DOI: 10.1002/bimj.201900371] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2019] [Revised: 04/30/2020] [Accepted: 06/03/2020] [Indexed: 11/09/2022]
Abstract
In precision medicine, a common problem is drug sensitivity prediction from cancer tissue cell lines. These types of problems entail modelling multivariate drug responses on high-dimensional molecular feature sets in typically >1000 cell lines. The dimensions of the problem require specialised models and estimation methods. In addition, external information on both the drugs and the features is often available. We propose to model the drug responses through a linear regression with shrinkage enforced through a normal inverse Gaussian prior. We let the prior depend on the external information, and estimate the model and external information dependence in an empirical-variational Bayes framework. We demonstrate the usefulness of this model in both a simulated setting and in the publicly available Genomics of Drug Sensitivity in Cancer data.
Collapse
|
20
|
PRECISE: a domain adaptation approach to transfer predictors of drug response from pre-clinical models to tumors. Bioinformatics 2020; 35:i510-i519. [PMID: 31510654 PMCID: PMC6612899 DOI: 10.1093/bioinformatics/btz372] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Motivation Cell lines and patient-derived xenografts (PDXs) have been used extensively to understand the molecular underpinnings of cancer. While core biological processes are typically conserved, these models also show important differences compared to human tumors, hampering the translation of findings from pre-clinical models to the human setting. In particular, employing drug response predictors generated on data derived from pre-clinical models to predict patient response remains a challenging task. As very large drug response datasets have been collected for pre-clinical models, and patient drug response data are often lacking, there is an urgent need for methods that efficiently transfer drug response predictors from pre-clinical models to the human setting. Results We show that cell lines and PDXs share common characteristics and processes with human tumors. We quantify this similarity and show that a regression model cannot simply be trained on cell lines or PDXs and then applied on tumors. We developed PRECISE, a novel methodology based on domain adaptation that captures the common information shared amongst pre-clinical models and human tumors in a consensus representation. Employing this representation, we train predictors of drug response on pre-clinical data and apply these predictors to stratify human tumors. We show that the resulting domain-invariant predictors show a small reduction in predictive performance in the pre-clinical domain but, importantly, reliably recover known associations between independent biomarkers and their companion drugs on human tumors. Availability and implementation PRECISE and the scripts for running our experiments are available on our GitHub page (https://github.com/NKI-CCB/PRECISE). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
21
|
Proteins in stool as biomarkers for non-invasive detection of colorectal adenomas with high risk of progression. J Pathol 2020; 250:288-298. [PMID: 31784980 PMCID: PMC7065084 DOI: 10.1002/path.5369] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Revised: 10/07/2019] [Accepted: 11/28/2019] [Indexed: 12/15/2022]
Abstract
Screening to detect colorectal cancer (CRC) in an early or premalignant state is an effective method to reduce CRC mortality rates. Current stool-based screening tests, e.g. fecal immunochemical test (FIT), have a suboptimal sensitivity for colorectal adenomas and difficulty distinguishing adenomas at high risk of progressing to cancer from those at lower risk. We aimed to identify stool protein biomarker panels that can be used for the early detection of high-risk adenomas and CRC. Proteomics data (LC-MS/MS) were collected on stool samples from adenoma (n = 71) and CRC patients (n = 81) as well as controls (n = 129). Colorectal adenoma tissue samples were characterized by low-coverage whole-genome sequencing to determine their risk of progression based on specific DNA copy number changes. Proteomics data were used for logistic regression modeling to establish protein biomarker panels. In total, 15 of the adenomas (15.8%) were defined as high risk of progressing to cancer. A protein panel, consisting of haptoglobin (Hp), LAMP1, SYNE2, and ANXA6, was identified for the detection of high-risk adenomas (sensitivity of 53% at specificity of 95%). Two panels, one consisting of Hp and LRG1 and one of Hp, LRG1, RBP4, and FN1, were identified for high-risk adenomas and CRCs detection (sensitivity of 66% and 62%, respectively, at specificity of 95%). Validation of Hp as a biomarker for high-risk adenomas and CRCs was performed using an antibody-based assay in FIT samples from a subset of individuals from the discovery series (n = 158) and an independent validation series (n = 795). Hp protein was significantly more abundant in high-risk adenoma FIT samples compared to controls in the discovery (p = 0.036) and the validation series (p = 9e-5). We conclude that Hp, LAMP1, SYNE2, LRG1, RBP4, FN1, and ANXA6 may be of value as stool biomarkers for early detection of high-risk adenomas and CRCs. © 2019 Authors. Journal of Pathology published by John Wiley & Sons Ltd on behalf of Pathological Society of Great Britain and Ireland.
Collapse
|
22
|
Molecular Characterization of Locally Relapsed Head and Neck Cancer after Concomitant Chemoradiotherapy. Clin Cancer Res 2019; 25:7256-7265. [DOI: 10.1158/1078-0432.ccr-19-0628] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2019] [Revised: 04/23/2019] [Accepted: 08/02/2019] [Indexed: 11/16/2022]
|
23
|
P2-269: GERIATRIC DEPRESSION SCALE ITEM-LEVEL ANALYSIS IN RELATION TO IN VIVO
CORTICAL AMYLOID AND CEREBRAL REGIONAL TAU IN CLINICALLY NORMAL OLDER ADULTS: FINDINGS FROM THE HARVARD AGING BRAIN STUDY. Alzheimers Dement 2019. [DOI: 10.1016/j.jalz.2019.06.2676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
24
|
IC-P-037: GERIATRIC DEPRESSION SCALE ITEM-LEVEL ANALYSIS IN RELATION TO IN VIVO
CORTICAL AMYLOID AND CEREBRAL REGIONAL TAU IN CLINICALLY NORMAL OLDER ADULTS: FINDINGS FROM THE HARVARD AGING BRAIN STUDY. Alzheimers Dement 2019. [DOI: 10.1016/j.jalz.2019.06.4199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
25
|
RNA-Seq in 296 phased trios provides a high-resolution map of genomic imprinting. BMC Biol 2019; 17:50. [PMID: 31234833 PMCID: PMC6589892 DOI: 10.1186/s12915-019-0674-0] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Accepted: 06/07/2019] [Indexed: 01/21/2023] Open
Abstract
Background Identification of imprinted genes, demonstrating a consistent preference towards the paternal or maternal allelic expression, is important for the understanding of gene expression regulation during embryonic development and of the molecular basis of developmental disorders with a parent-of-origin effect. Combining allelic analysis of RNA-Seq data with phased genotypes in family trios provides a powerful method to detect parent-of-origin biases in gene expression. Results We report findings in 296 family trios from two large studies: 165 lymphoblastoid cell lines from the 1000 Genomes Project and 131 blood samples from the Genome of the Netherlands (GoNL) participants. Based on parental haplotypes, we identified > 2.8 million transcribed heterozygous SNVs phased for parental origin and developed a robust statistical framework for measuring allelic expression. We identified a total of 45 imprinted genes and one imprinted unannotated transcript, including multiple imprinted transcripts showing incomplete parental expression bias that was located adjacent to strongly imprinted genes. For example, PXDC1, a gene which lies adjacent to the paternally expressed gene FAM50B, shows a 2:1 paternal expression bias. Other imprinted genes had promoter regions that coincide with sites of parentally biased DNA methylation identified in the blood from uniparental disomy (UPD) samples, thus providing independent validation of our results. Using the stranded nature of the RNA-Seq data in lymphoblastoid cell lines, we identified multiple loci with overlapping sense/antisense transcripts, of which one is expressed paternally and the other maternally. Using a sliding window approach, we searched for imprinted expression across the entire genome, identifying a novel imprinted putative lncRNA in 13q21.2. Overall, we identified 7 transcripts showing parental bias in gene expression which were not reported in 4 other recent RNA-Seq studies of imprinting. Conclusions Our methods and data provide a robust and high-resolution map of imprinted gene expression in the human genome. Electronic supplementary material The online version of this article (10.1186/s12915-019-0674-0) contains supplementary material, which is available to authorized users.
Collapse
|
26
|
Learning from a lot: Empirical Bayes for high-dimensional model-based prediction. Scand Stat Theory Appl 2019; 46:2-25. [PMID: 31007342 PMCID: PMC6472625 DOI: 10.1111/sjos.12335] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Revised: 01/24/2018] [Accepted: 03/22/2018] [Indexed: 12/21/2022]
Abstract
Empirical Bayes is a versatile approach to "learn from a lot" in two ways: first, from a large number of variables and, second, from a potentially large amount of prior information, for example, stored in public repositories. We review applications of a variety of empirical Bayes methods to several well-known model-based prediction methods, including penalized regression, linear discriminant analysis, and Bayesian models with sparse or dense priors. We discuss "formal" empirical Bayes methods that maximize the marginal likelihood but also more informal approaches based on other data summaries. We contrast empirical Bayes to cross-validation and full Bayes and discuss hybrid approaches. To study the relation between the quality of an empirical Bayes estimator and p, the number of variables, we consider a simple empirical Bayes estimator in a linear model setting. We argue that empirical Bayes is particularly useful when the prior contains multiple parameters, which model a priori information on variables termed "co-data". In particular, we present two novel examples that allow for co-data: first, a Bayesian spike-and-slab setting that facilitates inclusion of multiple co-data sources and types and, second, a hybrid empirical Bayes-full Bayes ridge regression approach for estimation of the posterior predictive interval.
Collapse
|
27
|
Genome-wide microRNA analysis of HPV-positive self-samples yields novel triage markers for early detection of cervical cancer. Int J Cancer 2018; 144:372-379. [PMID: 30192375 PMCID: PMC6518875 DOI: 10.1002/ijc.31855] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2018] [Revised: 07/20/2018] [Accepted: 07/30/2018] [Indexed: 02/06/2023]
Abstract
Offering self‐sampling for HPV testing improves the effectiveness of current cervical screening programs by increasing population coverage. Molecular markers directly applicable on self‐samples are needed to stratify HPV‐positive women at risk of cervical cancer (so‐called triage) and to avoid over‐referral and overtreatment. Deregulated microRNAs (miRNAs) have been implicated in the development of cervical cancer, and represent potential triage markers. However, it is unknown whether deregulated miRNA expression is reflected in self‐samples. Our study is the first to establish genome‐wide miRNA profiles in HPV‐positive self‐samples to identify miRNAs that can predict the presence of CIN3 and cervical cancer in self‐samples. Small RNA sequencing (sRNA‐Seq) was conducted to determine genome‐wide miRNA expression profiles in 74 HPV‐positive self‐samples of women with and without cervical precancer (CIN3). The optimal miRNA marker panel for CIN3 detection was determined by GRridge, a penalized method on logistic regression. Six miRNAs were validated by qPCR in 191 independent HPV‐positive self‐samples. Classification of sRNA‐Seq data yielded a 9‐miRNA marker panel with a combined area under the curve (AUC) of 0.89 for CIN3 detection. Validation by qPCR resulted in a combined AUC of 0.78 for CIN3+ detection. Our study shows that deregulated miRNA expression associated with CIN3 and cervical cancer development can be detected by sRNA‐Seq in HPV‐positive self‐samples. Validation by qPCR indicates that miRNA expression analysis offers a promising novel molecular triage strategy for CIN3 and cervical cancer detection applicable to self‐samples. What's new? MicroRNAs (miRNAs) are suspected of playing a role in cervical cancer development. They are also potential markers for the identification of human papillomavirus (HPV)‐infected women who are at risk of cervical cancer. Here, using small RNA sequencing of HPV‐positive self‐samples from women with and without cervical precancer (CIN3), the authors identify a miRNA signature consisting of multiple miRNAs that is strongly predictive of CIN3. Validation of this signature by qPCR revealed a good clinical performance for CIN3+ detection. The findings suggest that miRNA analysis is an effective means of CIN3+ prediction in HPV‐positive self‐samples obtained for cervical cancer screening.
Collapse
|
28
|
Combination of a six microRNA expression profile with four clinicopathological factors for response prediction of systemic treatment in patients with advanced colorectal cancer. PLoS One 2018; 13:e0201809. [PMID: 30075027 PMCID: PMC6075783 DOI: 10.1371/journal.pone.0201809] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Accepted: 07/22/2018] [Indexed: 12/19/2022] Open
Abstract
Background First line chemotherapy is effective in 75 to 80% of patients with metastatic colorectal cancer (mCRC). We studied whether microRNA (miR) expression profiles can predict treatment outcome for first line fluoropyrimidine containing systemic therapy in patients with mCRC. Methods MiR expression levels were determined by next generation sequencing from snap frozen tumor samples of 88 patients with mCRC. Predictive miRs were selected with penalized logistic regression and posterior forward selection. The prediction co-efficients of the miRs were re-estimated and validated by real-time quantitative PCR in an independent cohort of 81 patients with mCRC. Results Expression levels of miR-17-5p, miR-20a-5p, miR-30a-5p, miR-92a-3p, miR-92b-3p and miR-98-5p in combination with age, tumor differentiation, adjuvant therapy and type of systemic treatment, were predictive for clinical benefit in the training cohort with an AUC of 0.78. In the validation cohort the addition of the six miR signature to the four clinicopathological factors demonstrated a significant increased AUC for predicting treatment response versus those with stable disease (SD) from 0.79 to 0.90. The increase for predicting treatment response versus progressive disease (PD) and for patients with SD versus those with PD was not significant. in the validation cohort. MiR-17-5p, miR-20a-5p and miR-92a-3p were significantly upregulated in patients with treatment response in both the training and validation cohorts. Conclusion A six miR expression signature was identified that predicted treatment response to fluoropyrimidine containing first line systemic treatment in patients with mCRC when combined with four clinicopathological factors. Independent validation demonstrated added predictive value of this miR-signature for predicting treatment response versus SD. However, added predicted value for separating patients with PD could not be validated. The clinical relevance of the identified miRs for predicting treatment response has to be further explored.
Collapse
|
29
|
A Strategy to Find Suitable Reference Genes for miRNA Quantitative PCR Analysis and Its Application to Cervical Specimens. J Mol Diagn 2018; 19:625-637. [PMID: 28826607 DOI: 10.1016/j.jmoldx.2017.04.010] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2017] [Revised: 04/11/2017] [Accepted: 04/27/2017] [Indexed: 12/21/2022] Open
Abstract
miRNAs represent an emerging class of promising biomarkers for cancer diagnostics. To perform reliable miRNA expression analysis using quantitative PCR, adequate data normalization is essential to remove nonbiological, technical variations. Ideal reference genes should be biologically stable and reduce technical variability of miRNA expression analysis. Herein is a new strategy for the identification and evaluation of reference genes that can be applied for miRNA-based diagnostic tests without entailing excessive additional experiments. We analyzed the expression of 11 carefully selected candidate reference genes in different types of cervical specimens [ie, tissues, scrapes, and self-collected cervicovaginal specimens (self-samples)]. To identify the biologically most stable reference genes, three commonly used algorithms (GeNorm, NormFinder, and BestKeeper) were combined. Signal-to-noise ratios and P values between control and disease groups were calculated to validate the reduction in technical variability on expression analysis of two marker miRNAs. miR-423 was identified as a suitable reference gene for all sample types, to be used in combination with RNU24 in cervical tissues, RNU43 in scrapes, and miR-30b in self-samples. These findings demonstrate that the choice of reference genes may differ between different types of specimens, even when originating from the same anatomical source. More important, it is shown that adequate normalization increases the signal-to-noise ratio, which is not observed when normalizing to commonly used reference genes.
Collapse
|
30
|
Identification and Validation of a 3-Gene Methylation Classifier for HPV-Based Cervical Screening on Self-Samples. Clin Cancer Res 2018; 24:3456-3464. [PMID: 29632006 DOI: 10.1158/1078-0432.ccr-17-3615] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Revised: 02/22/2018] [Accepted: 04/02/2018] [Indexed: 01/09/2023]
Abstract
Purpose: Offering self-sampling of cervico-vaginal material for high-risk human papillomavirus (hrHPV) testing is an effective method to increase the coverage in cervical screening programs. Molecular triage directly on hrHPV-positive self-samples for colposcopy referral opens the way to full molecular cervical screening. Here, we set out to identify a DNA methylation classifier for detection of cervical precancer (CIN3) and cancer, applicable to lavage and brush self-samples.Experimental Design: We determined genome-wide DNA methylation profiles of 72 hrHPV-positive self-samples, using the Infinium Methylation 450K Array. The selected DNA methylation markers were evaluated by multiplex quantitative methylation-specific PCR (qMSP) in both hrHPV-positive lavage (n = 245) and brush (n = 246) self-samples from screening cohorts. Subsequently, logistic regression analysis was performed to build a DNA methylation classifier for CIN3 detection applicable to self-samples of both devices. For validation, an independent set of hrHPV-positive lavage (n = 199) and brush (n = 287) self-samples was analyzed.Results: Genome-wide DNA methylation profiling revealed 12 DNA methylation markers for CIN3 detection. Multiplex qMSP analysis of these markers in large series of lavage and brush self-samples yielded a 3-gene methylation classifier (ASCL1, LHX8, and ST6GALNAC5). This classifier showed a very good clinical performance for CIN3 detection in both lavage (AUC = 0.88; sensitivity = 74%; specificity = 79%) and brush (AUC = 0.90; sensitivity = 88%; specificity = 81%) self-samples in the validation set. Importantly, all self-samples from women with cervical cancer scored DNA methylation-positive.Conclusions: By genome-wide DNA methylation profiling on self-samples, we identified a highly effective 3-gene methylation classifier for direct triage on hrHPV-positive self-samples, which is superior to currently available methods. Clin Cancer Res; 24(14); 3456-64. ©2018 AACR.
Collapse
|
31
|
Genomic profiling of stage II and III colon cancers reveals APC mutations to be associated with survival in stage III colon cancer patients. Oncotarget 2018; 7:73876-73887. [PMID: 27729614 PMCID: PMC5342020 DOI: 10.18632/oncotarget.12510] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2016] [Accepted: 10/01/2016] [Indexed: 01/05/2023] Open
Abstract
Tumor profiling of DNA alterations, i.e. gene point mutations, somatic copy number aberrations (CNAs) and structural variants (SVs), improves insight into the molecular pathology of cancer and clinical outcome. Here, associations between genomic aberrations and disease recurrence in stage II and III colon cancers were investigated. A series of 114 stage II and III microsatellite stable colon cancer samples were analyzed by high-resolution array-comparative genomic hybridization (array-CGH) to detect CNAs and CNA-associated chromosomal breakpoints (SVs). For 60 of these samples mutation status of APC, TP53, KRAS, PIK3CA, FBXW7, SMAD4, BRAF and NRAS was determined using targeted massive parallel sequencing. Loss of chromosome 18q12.1-18q12.2 occurred more frequently in tumors that relapsed than in relapse-free tumors (p < 0.001; FDR = 0.13). In total, 267 genes were recurrently affected by SVs (FDR < 0.1). CNAs and SVs were not associated with disease-free survival (DFS). Mutations in APC and TP53 were associated with increased CNAs. APC mutations were associated with poor prognosis in (5-fluorouracil treated) stage III colon cancers (p = 0.005; HR = 4.1), an effect that was further enhanced by mutations in MAPK pathway (KRAS, NRAS, BRAF) genes. We conclude that among multiple genomic alterations in CRC, strongest associations with clinical outcome were observed for common mutations in APC.
Collapse
|
32
|
Improved high-dimensional prediction with Random Forests by the use of co-data. BMC Bioinformatics 2017; 18:584. [PMID: 29281963 PMCID: PMC5745983 DOI: 10.1186/s12859-017-1993-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2017] [Accepted: 12/06/2017] [Indexed: 12/13/2022] Open
Abstract
Background Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary ‘co-data’ can be used to improve the performance of a Random Forest in such a setting. Results Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate the co-data moderated Random Forest (CoRF) with two examples. In the first example we aim to predict the presence of a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance. In the second example we demonstrate how the prediction of cervical (pre-)cancer with methylation data can be improved by including the location of the probe relative to the known CpG islands, the number of CpG sites targeted by a probe, and a set of p-values from a related study. Conclusion The proposed method is able to utilize auxiliary co-data to improve the performance of a Random Forest. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1993-1) contains supplementary material, which is available to authorized users.
Collapse
|
33
|
Blood-based metabolic signatures in Alzheimer's disease. ALZHEIMER'S & DEMENTIA: DIAGNOSIS, ASSESSMENT & DISEASE MONITORING 2017; 8:196-207. [PMID: 28951883 PMCID: PMC5607205 DOI: 10.1016/j.dadm.2017.07.006] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Introduction Identification of blood-based metabolic changes might provide early and easy-to-obtain biomarkers. Methods We included 127 Alzheimer's disease (AD) patients and 121 control subjects with cerebrospinal fluid biomarker-confirmed diagnosis (cutoff tau/amyloid β peptide 42: 0.52). Mass spectrometry platforms determined the concentrations of 53 amine compounds, 22 organic acid compounds, 120 lipid compounds, and 40 oxidative stress compounds. Multiple signatures were assessed: differential expression (nested linear models), classification (logistic regression), and regulatory (network extraction). Results Twenty-six metabolites were differentially expressed. Metabolites improved the classification performance of clinical variables from 74% to 79%. Network models identified five hubs of metabolic dysregulation: tyrosine, glycylglycine, glutamine, lysophosphatic acid C18:2, and platelet-activating factor C16:0. The metabolite network for apolipoprotein E (APOE) ε4 negative AD patients was less cohesive compared with the network for APOE ε4 positive AD patients. Discussion Multiple signatures point to various promising peripheral markers for further validation. The network differences in AD patients according to APOE genotype may reflect different pathways to AD. Multiple metabolic signatures point to peripheral AD markers for future validation. AD may be described by changes in the metabolism of amines and oxidative stressors. APOE ε4-driven AD and non- APOE ε4-driven AD represent different biochemical pathways. Network analyses of metabolomics data enable the study of metabolic changes in AD.
Collapse
|
34
|
Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis. BMC Med Res Methodol 2017; 17:129. [PMID: 28830466 PMCID: PMC5568368 DOI: 10.1186/s12874-017-0404-7] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2017] [Accepted: 08/02/2017] [Indexed: 11/20/2022] Open
Abstract
Background Multiple imputation is a recommended method to handle missing data. For significance testing after multiple imputation, Rubin’s Rules (RR) are easily applied to pool parameter estimates. In a logistic regression model, to consider whether a categorical covariate with more than two levels significantly contributes to the model, different methods are available. For example pooling chi-square tests with multiple degrees of freedom, pooling likelihood ratio test statistics, and pooling based on the covariance matrix of the regression model. These methods are more complex than RR and are not available in all mainstream statistical software packages. In addition, they do not always obtain optimal power levels. We argue that the median of the p-values from the overall significance tests from the analyses on the imputed datasets can be used as an alternative pooling rule for categorical variables. The aim of the current study is to compare different methods to test a categorical variable for significance after multiple imputation on applicability and power. Methods In a large simulation study, we demonstrated the control of the type I error and power levels of different pooling methods for categorical variables. Results This simulation study showed that for non-significant categorical covariates the type I error is controlled and the statistical power of the median pooling rule was at least equal to current multiple parameter tests. An empirical data example showed similar results. Conclusions It can therefore be concluded that using the median of the p-values from the imputed data analyses is an attractive and easy to use alternative method for significance testing of categorical variables. Electronic supplementary material The online version of this article (doi:10.1186/s12874-017-0404-7) contains supplementary material, which is available to authorized users.
Collapse
|
35
|
Prognostic modeling of oral cancer by gene profiles and clinicopathological co-variables. Oncotarget 2017; 8:59312-59323. [PMID: 28938638 PMCID: PMC5601734 DOI: 10.18632/oncotarget.19576] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2017] [Accepted: 06/12/2017] [Indexed: 12/17/2022] Open
Abstract
Accurate staging and outcome prediction is a major problem in clinical management of oral cancer patients, hampering high precision treatment and adjuvant therapy planning. Here, we have built and validated multivariable models that integrate gene signatures with clinical and pathological variables to improve staging and survival prediction of patients with oral squamous cell carcinoma (OSCC). Gene expression profiles from 249 human papillomavirus (HPV)-negative OSCCs were explored to identify a 22-gene lymph node metastasis signature (LNMsig) and a 40-gene overall survival signature (OSsig). To facilitate future clinical implementation and increase performance, these signatures were transferred to quantitative polymerase chain reaction (qPCR) assays and validated in an independent cohort of 125 HPV-negative tumors. When applied in the clinically relevant subgroup of early-stage (cT1-2N0) OSCC, the LNMsig could prevent overtreatment in two-third of the patients. Additionally, the integration of RT-qPCR gene signatures with clinical and pathological variables provided accurate prognostic models for oral cancer, strongly outperforming TNM. Finally, the OSsig gene signature identified a subpopulation of patients, currently considered at low-risk for disease-related survival, who showed an unexpected poor prognosis. These well-validated models will assist in personalizing primary treatment with respect to neck dissection and adjuvant therapies.
Collapse
|
36
|
GeneBreak: detection of recurrent DNA copy number aberration-associated chromosomal breakpoints within genes. F1000Res 2017; 5:2340. [PMID: 28713543 PMCID: PMC5500957 DOI: 10.12688/f1000research.9259.2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/05/2017] [Indexed: 01/23/2023] Open
Abstract
Development of cancer is driven by somatic alterations, including numerical and structural chromosomal aberrations. Currently, several computational methods are available and are widely applied to detect numerical copy number aberrations (CNAs) of chromosomal segments in tumor genomes. However, there is lack of computational methods that systematically detect structural chromosomal aberrations by virtue of the genomic location of CNA-associated chromosomal breaks and identify genes that appear non-randomly affected by chromosomal breakpoints across (large) series of tumor samples. ‘GeneBreak’ is developed to systematically identify genes recurrently affected by the genomic location of chromosomal CNA-associated breaks by a genome-wide approach, which can be applied to DNA copy number data obtained by array-Comparative Genomic Hybridization (CGH) or by (low-pass) whole genome sequencing (WGS). First, ‘GeneBreak’ collects the genomic locations of chromosomal CNA-associated breaks that were previously pinpointed by the segmentation algorithm that was applied to obtain CNA profiles. Next, a tailored annotation approach for breakpoint-to-gene mapping is implemented. Finally, dedicated cohort-based statistics is incorporated with correction for covariates that influence the probability to be a breakpoint gene. In addition, multiple testing correction is integrated to reveal recurrent breakpoint events. This easy-to-use algorithm, ‘GeneBreak’, is implemented in R (
www.cran.r-project.org) and is available from Bioconductor (
www.bioconductor.org/packages/release/bioc/html/GeneBreak.html).
Collapse
|
37
|
Better diagnostic signatures from RNAseq data through use of auxiliary co-data. Bioinformatics 2017; 33:1572-1574. [PMID: 28073760 DOI: 10.1093/bioinformatics/btw837] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2016] [Indexed: 02/06/2023] Open
Abstract
SUMMARY Our aim is to improve omics based prediction and feature selection using multiple sources of auxiliary information: co-data. Adaptive group regularized ridge regression (GRridge) was proposed to achieve this by estimating additional group-based penalty parameters through an empirical Bayes method at a low computational cost. We illustrate the GRridge method and software on RNA sequencing datasets. The method boosts the performance of an ordinary ridge regression and outperforms other classifiers. Post-hoc feature selection maintains the predictive ability of the classifier with far fewer markers. AVAILABILITY AND IMPLEMENTATION GRridge is an R package that includes a vignette. It is freely available at ( https://bioconductor.org/packages/GRridge/ ). All information and R scripts used in this study, including those on retrieval and processing of the co-data, are available from http://github.com/markvdwiel/GRridgeCodata . CONTACT mark.vdwiel@vumc.nl. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
38
|
An empirical Bayes approach to network recovery using external knowledge. Biom J 2017; 59:932-947. [PMID: 28393396 DOI: 10.1002/bimj.201600090] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2016] [Revised: 11/22/2016] [Accepted: 12/04/2016] [Indexed: 11/12/2022]
Abstract
Reconstruction of a high-dimensional network may benefit substantially from the inclusion of prior knowledge on the network topology. In the case of gene interaction networks such knowledge may come for instance from pathway repositories like KEGG, or be inferred from data of a pilot study. The Bayesian framework provides a natural means of including such prior knowledge. Based on a Bayesian Simultaneous Equation Model, we develop an appealing Empirical Bayes (EB) procedure that automatically assesses the agreement of the used prior knowledge with the data at hand. We use variational Bayes method for posterior densities approximation and compare its accuracy with that of Gibbs sampling strategy. Our method is computationally fast, and can outperform known competitors. In a simulation study, we show that accurate prior data can greatly improve the reconstruction of the network, but need not harm the reconstruction if wrong. We demonstrate the benefits of the method in an analysis of gene expression data from GEO. In particular, the edges of the recovered network have superior reproducibility (compared to that of competitors) over resampled versions of the data.
Collapse
|
39
|
Abstract
Reconstructing a gene network from high-throughput molecular data is an important but challenging task, as the number of parameters to estimate easily is much larger than the sample size. A conventional remedy is to regularize or penalize the model likelihood. In network models, this is often done locally in the neighbourhood of each node or gene. However, estimation of the many regularization parameters is often difficult and can result in large statistical uncertainties. In this paper we propose to combine local regularization with global shrinkage of the regularization parameters to borrow strength between genes and improve inference. We employ a simple Bayesian model with non-sparse, conjugate priors to facilitate the use of fast variational approximations to posteriors. We discuss empirical Bayes estimation of hyper-parameters of the priors, and propose a novel approach to rank-based posterior thresholding. Using extensive model- and data-based simulations, we demonstrate that the proposed inference strategy outperforms popular (sparse) methods, yields more stable edges, and is more reproducible. The proposed method, termed ShrinkNet, is then applied to Glioblastoma to investigate the interactions between genes associated with patient survival.
Collapse
|
40
|
Abstract A07: Detection of structural variants and recurrent breakpoint genes in colorectal adenoma-to-carcinoma progression. Cancer Res 2017. [DOI: 10.1158/1538-7445.crc16-a07] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Background and Aim: The transition of pre-malignant colorectal adenomas (CRAs) into malignant colorectal carcinomas (CRCs) is accompanied by accumulation of somatic genetic alterations. The contribution of small nucleotide variants (SNVs, such as gene point mutations) and numerical DNA copy number aberrations (CNAs, such as gains and losses of large chromosomal segments) in tumor genomes has been studied extensively. In contrast, studies that systematically detect recurrent structural variants (SVs, such as chromosomal breaks) across large series of clinically well-defined samples are scarce. We recently developed “GeneBreak”, a computational method to identify genes that are recurrently affected by the genomic location of CNA-associated chromosomal breaks, and demonstrated high prevalence and clinical relevance of recurrent breakpoint genes in advanced CRCs (PLoS One 2015;10(9):e0138141). Using a similar approach, the present study aimed to determine what recurrent breakpoint genes contribute to colorectal adenoma-to-carcinoma progression.
Methods: Differences in somatic CNA-associated gene breakpoint frequencies in 466 CRC and 118 CRA samples were examined using high-resolution array-comparative genomic hybridization (aCGH) profiles. Pearson's Chi-square statistic was applied to determine differences in gene breakpoint frequencies between CRC and CRA samples.
Results: In total 21 recurrent breakpoint genes were more frequently affected in CRCs compared to CRAs (p<0.05). MACROD2 was affected by chromosomal breakpoints in 40% of CRCs while not being affected in CRAs. The frequencies of SVs in the other 20 recurrent breakpoint genes ranged from 5% to 29% in CRCs, and from 0% to 2% in CRAs.
Conclusion: We identified 21 recurrent breakpoint genes that are frequently affected by SVs in CRCs but not in CRAs. Additional studies are needed to further elaborate the biological and clinical role of these 21 candidate driver genes for colorectal tumor progression.
Citation Format: Evert van den Broek, Maurits J.J. Dijkstra, Quirinus J.M. Voorham, Beatriz Carvalho, Mark A. van de Wiel, Sanne Abeln, Gerrit A. Meijer, Remond J.A. Fijneman. Detection of structural variants and recurrent breakpoint genes in colorectal adenoma-to-carcinoma progression. [abstract]. In: Proceedings of the AACR Special Conference on Colorectal Cancer: From Initiation to Outcomes; 2016 Sep 17-20; Tampa, FL. Philadelphia (PA): AACR; Cancer Res 2017;77(3 Suppl):Abstract nr A07.
Collapse
|
41
|
Erratum to: Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation. BMC Med Res Methodol 2016; 16:170. [PMID: 27919231 PMCID: PMC5139144 DOI: 10.1186/s12874-016-0271-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
|
42
|
Abstract 4928: MiR expression profiles can predict response to systemic treatment in patients with advanced colorectal cancer. Cancer Res 2016. [DOI: 10.1158/1538-7445.am2016-4928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Background and aim: Patients with advanced colorectal cancer (mCRC) are commonly treated with systemic treatment consisting of fluoropyrimidine-based regimens being ineffective in 20-25% of the patients. Currently, selection criteria for patients to predict who will respond to this treatment is lacking. The aim of this study is to identify which patients will respond to first line fluoropyrimidine-based treatment based using microRNA (miR) expression profiles in order to avoid ineffective treatment.
Material and methods: Total RNA was isolated from 88 fresh frozen colorectal cancer tissue samples consisting of ≥ 70% tumor cells, collected from patients with mCRC. MiR expression profiles were generated by next generation sequencing using the Illumina High Seq 2000 platform. Of all patients clinical and pathological data, including treatment response based on RECIST criteria, were collected. Class prediction and miR selection were performed using the GRidge package in R. Penalized selection and internal cross validation were used to select miRs predictive for treatment response.
RESULTS: Next generation sequencing resulted in a mean of 10.087.107 (range 6.114.932 to 74.313.067) reads per sample corresponding to 2567 unique mature miR sequences, including 457 novel candidate and 2110 known miRs sequences (miRbase version 19). Penalized regression analysis on tumor specific miRs identified an expression profile which was predictive for clinical benefit (defined as response and stable disease) from first line treatment.
Conclusion: With miR profiling of CRC tissue samples response prediction to first line fluoropyrimidine-based treatment in patients with mCRC is possible. We foresee that selection of treatment using miR expression profiling will avoid unnecessary treatment related toxicity and improve outcome for patients with mCRC
Citation Format: Dennis Poel, Maarten Neerincx, Daoud L.S. Sie, Nicole C.T. van Grieken, R Shankaraiah, F. S.W. van der Wolf, J. H.T.M van Waesberghe, J. D. Burggraaf, Paul P. Eijk, Bauke Ylstra, Cees Verhoef, Mark A. van de Wiel, Henk M.W. Verheul, Tineke E. Buffart. MiR expression profiles can predict response to systemic treatment in patients with advanced colorectal cancer. [abstract]. In: Proceedings of the 107th Annual Meeting of the American Association for Cancer Research; 2016 Apr 16-20; New Orleans, LA. Philadelphia (PA): AACR; Cancer Res 2016;76(14 Suppl):Abstract nr 4928.
Collapse
|
43
|
Reduced genomic tumor heterogeneity after neoadjuvant chemotherapy is related to favorable outcome in patients with esophageal adenocarcinoma. Oncotarget 2016; 7:44084-44095. [PMID: 27286451 PMCID: PMC5190081 DOI: 10.18632/oncotarget.9857] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2016] [Accepted: 04/29/2016] [Indexed: 11/25/2022] Open
Abstract
Neoadjuvant chemo(radio)therapy followed by surgery is the standard of care for patients with locally advanced resectable esophageal adenocarcinoma (EAC). There is increasing evidence that drug resistance might be related to genomic heterogeneity. We investigated whether genomic tumor heterogeneity is different after cytotoxic chemotherapy and is associated with EAC patient survival. We used arrayCGH and a quantitative assessment of the whole genome DNA copy number aberration patterns ('DNA copy number entropy') to establish the level of genomic tumor heterogeneity in 80 EAC treated with neoadjuvant chemotherapy followed by surgery (CS group) or surgery alone (S group). The association between DNA copy number entropy, clinicopathological variables and survival was investigated.DNA copy number entropy was reduced after chemotherapy, even if there was no morphological evidence of response to therapy (p<0.001). Low DNA copy number entropy was associated with improved survival in the CS group (p=0.011) but not in the S group (p=0.396).Our results suggest that cytotoxic chemotherapy reduces DNA copy number entropy, which might be a more sensitive tumor response marker than changes in the morphological tumor phenotype. The use of DNA copy number entropy in clinical practice will require validation of our results in a prospective study.
Collapse
|
44
|
P1‐201: Quality Indicators of Pre‐Analytical Variation in Cerebrospinal Fluid Detected with Aptamer Screening. Alzheimers Dement 2016. [DOI: 10.1016/j.jalz.2016.06.949] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
45
|
Testing for association between RNA-Seq and high-dimensional data. BMC Bioinformatics 2016; 17:118. [PMID: 26951498 PMCID: PMC4782413 DOI: 10.1186/s12859-016-0961-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Accepted: 02/18/2016] [Indexed: 11/13/2022] Open
Abstract
Background Testing for association between RNA-Seq and other genomic data is challenging due to high variability of the former and high dimensionality of the latter. Results Using the negative binomial distribution and a random-effects model, we develop an omnibus test that overcomes both difficulties. It may be conceptualised as a test of overall significance in regression analysis, where the response variable is overdispersed and the number of explanatory variables exceeds the sample size. Conclusions The proposed test can detect genetic and epigenetic alterations that affect gene expression. It can examine complex regulatory mechanisms of gene expression. The R package globalSeq is available from Bioconductor. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0961-5) contains supplementary material, which is available to authorized users.
Collapse
|
46
|
Abstract 52: QDNAseq: A bioinformatics pipeline for DNA copy number analysis from shallow whole genome sequencing with noise levels near the probabilistic lower limit imposed by read counting. Clin Cancer Res 2016. [DOI: 10.1158/1557-3265.pmsclingen15-52] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
The emerging field of precision medicine considers the patient's individual genetic make up, environment, and lifestyle to arrive at individualized treatment. For cancer patients, analysis of the tumour genome can reveal actionable alterations, including mutations (e.g., BRAF mutation in melanoma), translocations (e.g., t(9;22)(q34;q11) in CML) and copy number changes (e.g., HER2 amplification in breast cancer). Formalin-fixed paraffin embedded (FFPE) biopsy or surgical pathology samples are often the only materials available for analysis, highlighting the need for genomic analysis techniques that perform reliably with such materials. We present a complete pipeline for DNA copy number profiling by sequencing called quantitative DNA sequencing (QDNAseq), which offers a cost effective procedure for genome-wide DNA copy number analysis, facilitates correct biological interpretations from samples with a varying range of quality and is suitable for routine pathology diagnostics from fresh, frozen and FFPE materials.
QDNAseq is implemented in R and made available through BioConductor and Galaxy. The pipeline requires as little as ∼0.1× genome coverage (∼6 million single-end 50nt reads per sample) and utilizes a depth of coverage (DOC) approach. A two-dimensional LOESS correction is applied for GC content and mappability and data from the 1000 genomes project are used to compile a blacklist of problematic genome regions. Downstream analyses have been included in QDNAseq by adapting algorithms previously developed for segmentation (DNAcopy) and the calling of gains and losses (CGHcall).
QDNAseq is able to produce profiles with noise levels near the probabilistic lower limit imposed by read counting and yields a higher resolution than is available from microarrays. Some samples, however, have variances clearly above the probabilistic lower limit and are characterized by a wave pattern. This pattern originates in the sample, as it is reproducible, both in magnitude and in shape along the genome. Much of this wave bias can be reduced by regression with profiles from matched normal FFPE DNA samples using the algorithm implemented in NoWaves.
The QDNAseq procedure is highly cost effective and robust. We have processed over 1500 clinical samples obtained from the FFPE archives of more than 25 institutions from Europe and North America.
Citation Format: Daoud Sie, Ilari Scheinin, Stef Lieshout, van, Martijn Cordes, Daniel Pinkel, Donna G. Albertson, Mark A. Wiel, van de, Bauke Ylstra. QDNAseq: A bioinformatics pipeline for DNA copy number analysis from shallow whole genome sequencing with noise levels near the probabilistic lower limit imposed by read counting. [abstract]. In: Proceedings of the AACR Precision Medicine Series: Integrating Clinical Genomics and Cancer Therapy; Jun 13-16, 2015; Salt Lake City, UT. Philadelphia (PA): AACR; Clin Cancer Res 2016;22(1_Suppl):Abstract nr 52.
Collapse
|
47
|
Better prediction by use of co-data: adaptive group-regularized ridge regression. Stat Med 2015; 35:368-81. [PMID: 26365903 DOI: 10.1002/sim.6732] [Citation(s) in RCA: 49] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2014] [Revised: 05/22/2015] [Accepted: 08/22/2015] [Indexed: 12/23/2022]
Abstract
For many high-dimensional studies, additional information on the variables, like (genomic) annotation or external p-values, is available. In the context of binary and continuous prediction, we develop a method for adaptive group-regularized (logistic) ridge regression, which makes structural use of such 'co-data'. Here, 'groups' refer to a partition of the variables according to the co-data. We derive empirical Bayes estimates of group-specific penalties, which possess several nice properties: (i) They are analytical. (ii) They adapt to the informativeness of the co-data for the data at hand. (iii) Only one global penalty parameter requires tuning by cross-validation. In addition, the method allows use of multiple types of co-data at little extra computational effort. We show that the group-specific penalties may lead to a larger distinction between 'near-zero' and relatively large regression parameters, which facilitates post hoc variable selection. The method, termed GRridge, is implemented in an easy-to-use R-package. It is demonstrated on two cancer genomics studies, which both concern the discrimination of precancerous cervical lesions from normal cervix tissues using methylation microarray data. For both examples, GRridge clearly improves the predictive performances of ordinary logistic ridge regression and the group lasso. In addition, we show that for the second study, the relatively good predictive performance is maintained when selecting only 42 variables.
Collapse
|
48
|
Molecular imaging of aurora kinase A (AURKA) expression: Synthesis and preclinical evaluation of radiolabeled alisertib (MLN8237). Nucl Med Biol 2015; 43:63-72. [PMID: 26432753 DOI: 10.1016/j.nucmedbio.2015.08.007] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2015] [Revised: 08/25/2015] [Accepted: 08/31/2015] [Indexed: 11/29/2022]
Abstract
INTRODUCTION Survival of patients after resection of colorectal cancer liver metastasis (CRCLM) is 36%-58%. Positron emission tomography (PET) tracers, imaging the expression of prognostic biomarkers, may contribute to assign appropriate management to individual patients. Aurora kinase A (AURKA) expression is associated with survival of patients after CRCLM resection. METHODS We synthesized [(3)H]alisertib and [(11)C]alisertib, starting from [(3)H]methyl nosylate and [(11)C]methyl iodide, respectively. We measured in vitro uptake of [(3)H]alisertib in cancer cells with high (Caco2), moderate (A431, HCT116, SW480) and low (MKN45) AURKA expression, before and after siRNA-mediated AURKA downmodulation, as well as after inhibition of P-glycoprotein (P-gp) activity. We measured in vivo uptake and biodistribution of [(11)C]alisertib in nude mice, xenografted with A431, HCT116 or MKN45 cells, or P-gp knockout mice. RESULTS [(3)H]Alisertib was synthesized with an overall yield of 42% and [(11)C]alisertib with an overall yield of 23%±9% (radiochemical purity ≥99%). Uptake of [(3)H]alisertib in Caco2 cells was higher than in A431 cells (P=.02) and higher than in SW480, HCT116 and MKN45 cells (P<.01). Uptake in A431 cells was higher than in SW480, HCT116 and MKN45 cells (P<.01). Downmodulation of AURKA expression reduced [(3)H]alisertib uptake in Caco2 cells (P<.01). P-gp inhibition increased [(3)H]alisertib uptake in Caco2 (P<.01) and MKN45 (P<.01) cells. In vivo stability of [(11)C]alisertib 90min post-injection was 94.7%±1.3% and tumor-to-background ratios were 2.3±0.8 (A431), 1.6±0.5 (HCT116) and 1.9±0.5 (MKN45). In brains of P-gp knockout mice [(11)C]alisertib uptake was increased compared to uptake in wild-type mice (P<.01) CONCLUSIONS: Radiolabeled alisertib can be synthesized and may have potential for the imaging of AURKA, particularly when AURKA expression is high. However, the exact mechanisms underlying alisertib accumulation need further investigation. ADVANCES IN KNOWLEDGE AND IMPLICATIONS FOR PATIENT CARE Radiolabeled alisertib may be used for non-invasively measuring AURKA protein expression and to stratify patients for treatment accordingly.
Collapse
|
49
|
Comparison of deep sequencing miRNA expression analysis in primary colorectal cancer and paired metastases. J Clin Oncol 2015. [DOI: 10.1200/jco.2015.33.15_suppl.e14682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
50
|
HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree. Cancer Inform 2015; 14:1-19. [PMID: 25861213 PMCID: PMC4372030 DOI: 10.4137/cin.s22080] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2014] [Revised: 01/05/2015] [Accepted: 01/06/2015] [Indexed: 11/05/2022] Open
Abstract
Hierarchical clustering (HC) is one of the most frequently used methods in computational biology in the analysis of high-dimensional genomics data. Given a data set, HC outputs a binary tree leaves of which are the data points and internal nodes represent clusters of various sizes. Normally, a fixed-height cut on the HC tree is chosen, and each contiguous branch of data points below that height is considered as a separate cluster. However, the fixed-height branch cut may not be ideal in situations where one expects a complicated tree structure with nested clusters. Furthermore, due to lack of utilization of related background information in selecting the cutoff, induced clusters are often difficult to interpret. This paper describes a novel procedure that aims to automatically extract meaningful clusters from the HC tree in a semi-supervised way. The procedure is implemented in the R package HCsnip available from Bioconductor. Rather than cutting the HC tree at a fixed-height, HCsnip probes the various way of snipping, possibly at variable heights, to tease out hidden clusters ensconced deep down in the tree. The cluster extraction process utilizes, along with the data set from which the HC tree is derived, commonly available background information. Consequently, the extracted clusters are highly reproducible and robust against various sources of variations that "haunted" high-dimensional genomics data. Since the clustering process is guided by the background information, clusters are easy to interpret. Unlike existing packages, no constraint is placed on the data type on which clustering is desired. Particularly, the package accepts patient follow-up data for guiding the cluster extraction process. To our knowledge, HCsnip is the first package that is able to decomposes the HC tree into clusters with piecewise snipping under the guidance of patient time-to-event information. Our implementation of the semi-supervised HC tree snipping framework is generic, and can be combined with other algorithms that operate on detected clusters.
Collapse
|