1
|
Li Z, Zhong W, Liao W, Zhao J, Yu M, He G. A Novel Clustering Method Based on Adjacent Grids Searching. Entropy (Basel) 2023; 25:1342. [PMID: 37761640 PMCID: PMC10528124 DOI: 10.3390/e25091342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 09/06/2023] [Accepted: 09/12/2023] [Indexed: 09/29/2023]
Abstract
Clustering is used to analyze the intrinsic structure of a dataset based on the similarity of datapoints. Its widespread use, from image segmentation to object recognition and information retrieval, requires great robustness in the clustering process. In this paper, a novel clustering method based on adjacent grid searching (CAGS) is proposed. The CAGS consists of two steps: a strategy based on adaptive grid-space construction and a clustering strategy based on adjacent grid searching. In the first step, a multidimensional grid space is constructed to provide a quantization structure of the input dataset. The noise and cluster halo are automatically distinguished according to grid density. Moreover, the adaptive grid generating process solves the common problem of grid clustering, in which the number of cells increases sharply with the dimension. In the second step, a two-stage traversal process is conducted to accomplish the cluster recognition. The cluster cores with arbitrary shapes can be found by concealing the halo points. As a result, the number of clusters will be easily identified by CAGS. Therefore, CAGS has the potential to be widely used for clustering datasets with different characteristics. We test the clustering performance of CAGS through six different types of datasets: dataset with noise, large-scale dataset, high-dimensional dataset, dataset with arbitrary shapes, dataset with large differences in density between classes, and dataset with high overlap between classes. Experimental results show that CAGS, which performed best on 10 out of 11 tests, outperforms the state-of-the-art clustering methods in all the above datasets.
Collapse
Affiliation(s)
- Zhimeng Li
- School of Control and Mechanical Engineering, Tianjin Chengjian University, Tianjin 300384, China; (Z.L.)
| | - Wen Zhong
- School of Control and Mechanical Engineering, Tianjin Chengjian University, Tianjin 300384, China; (Z.L.)
| | - Weiwen Liao
- School of Control and Mechanical Engineering, Tianjin Chengjian University, Tianjin 300384, China; (Z.L.)
| | - Jian Zhao
- School of Control and Mechanical Engineering, Tianjin Chengjian University, Tianjin 300384, China; (Z.L.)
| | - Ming Yu
- School of Computer and Information Engineering, Tianjin Chengjian University, Tianjin 300384, China
| | - Gaiyun He
- School of Mechanical Engineering, Tianjin University, Tianjin 300072, China
| |
Collapse
|
2
|
Courbariaux M, De Santiago K, Dalmasso C, Danjou F, Bekadar S, Corvol JC, Martinez M, Szafranski M, Ambroise C. A Sparse Mixture-of-Experts Model With Screening of Genetic Associations to Guide Disease Subtyping. Front Genet 2022; 13:859462. [PMID: 35734430 PMCID: PMC9207464 DOI: 10.3389/fgene.2022.859462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 04/21/2022] [Indexed: 11/27/2022] Open
Abstract
Motivation: Identifying new genetic associations in non-Mendelian complex diseases is an increasingly difficult challenge. These diseases sometimes appear to have a significant component of heritability requiring explanation, and this missing heritability may be due to the existence of subtypes involving different genetic factors. Taking genetic information into account in clinical trials might potentially have a role in guiding the process of subtyping a complex disease. Most methods dealing with multiple sources of information rely on data transformation, and in disease subtyping, the two main strategies used are 1) the clustering of clinical data followed by posterior genetic analysis and 2) the concomitant clustering of clinical and genetic variables. Both of these strategies have limitations that we propose to address. Contribution: This work proposes an original method for disease subtyping on the basis of both longitudinal clinical variables and high-dimensional genetic markers via a sparse mixture-of-regressions model. The added value of our approach lies in its interpretability in relation to two aspects. First, our model links both clinical and genetic data with regard to their initial nature (i.e., without transformation) and does not require post-processing where the original information is accessed a second time to interpret the subtypes. Second, it can address large-scale problems because of a variable selection step that is used to discard genetic variables that may not be relevant for subtyping. Results: The proposed method was validated on simulations. A dataset from a cohort of Parkinson's disease patients was also analyzed. Several subtypes of the disease and genetic variants that potentially have a role in this typology were identified. Software availability: The R code for the proposed method, named DiSuGen, and a tutorial are available for download (see the references).
Collapse
Affiliation(s)
- Marie Courbariaux
- Université Paris-Saclay, CNRS, Université d’Évry, Laboratoire de Mathématiques et Modélisation d’Évry, Évry-Courcouronnes, France
| | - Kylliann De Santiago
- Université Paris-Saclay, CNRS, Université d’Évry, Laboratoire de Mathématiques et Modélisation d’Évry, Évry-Courcouronnes, France
| | - Cyril Dalmasso
- Université Paris-Saclay, CNRS, Université d’Évry, Laboratoire de Mathématiques et Modélisation d’Évry, Évry-Courcouronnes, France
| | - Fabrice Danjou
- Sorbonne Université, Paris Brain Institute–ICM, Inserm, CNRS, Assistance Publique Hôpitaux de Paris, Pitié-Salpêtrière Hospital, Department of Neurology, Paris, France
| | - Samir Bekadar
- Sorbonne Université, Paris Brain Institute–ICM, Inserm, CNRS, Assistance Publique Hôpitaux de Paris, Pitié-Salpêtrière Hospital, Department of Neurology, Paris, France
| | - Jean-Christophe Corvol
- Sorbonne Université, Paris Brain Institute–ICM, Inserm, CNRS, Assistance Publique Hôpitaux de Paris, Pitié-Salpêtrière Hospital, Department of Neurology, Paris, France
| | - Maria Martinez
- Institut de Recherche en Santé Digestive, Inserm, CHU Purpan, Toulouse, France
| | - Marie Szafranski
- Université Paris-Saclay, CNRS, Université d’Évry, Laboratoire de Mathématiques et Modélisation d’Évry, Évry-Courcouronnes, France
- ENSIIE, Évry-Courcouronnes, France
| | - Christophe Ambroise
- Université Paris-Saclay, CNRS, Université d’Évry, Laboratoire de Mathématiques et Modélisation d’Évry, Évry-Courcouronnes, France
| |
Collapse
|
3
|
Escribe C, Lu T, Keller-Baruch J, Forgetta V, Xiao B, Richards JB, Bhatnagar S, Oualkacha K, Greenwood CMT. Block coordinate descent algorithm improves variable selection and estimation in error-in-variables regression. Genet Epidemiol 2021; 45:874-890. [PMID: 34468045 PMCID: PMC9292988 DOI: 10.1002/gepi.22430] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Revised: 07/19/2021] [Accepted: 08/12/2021] [Indexed: 11/13/2022]
Abstract
Medical research increasingly includes high‐dimensional regression modeling with a need for error‐in‐variables methods. The Convex Conditioned Lasso (CoCoLasso) utilizes a reformulated Lasso objective function and an error‐corrected cross‐validation to enable error‐in‐variables regression, but requires heavy computations. Here, we develop a Block coordinate Descent Convex Conditioned Lasso (BDCoCoLasso) algorithm for modeling high‐dimensional data that are only partially corrupted by measurement error. This algorithm separately optimizes the estimation of the uncorrupted and corrupted features in an iterative manner to reduce computational cost, with a specially calibrated formulation of cross‐validation error. Through simulations, we show that the BDCoCoLasso algorithm successfully copes with much larger feature sets than CoCoLasso, and as expected, outperforms the naïve Lasso with enhanced estimation accuracy and consistency, as the intensity and complexity of measurement errors increase. Also, a new smoothly clipped absolute deviation penalization option is added that may be appropriate for some data sets. We apply the BDCoCoLasso algorithm to data selected from the UK Biobank. We develop and showcase the utility of covariate‐adjusted genetic risk scores for body mass index, bone mineral density, and lifespan. We demonstrate that by leveraging more information than the naïve Lasso in partially corrupted data, the BDCoCoLasso may achieve higher prediction accuracy. These innovations, together with an R package, BDCoCoLasso, make error‐in‐variables adjustments more accessible for high‐dimensional data sets. We posit the BDCoCoLasso algorithm has the potential to be widely applied in various fields, including genomics‐facilitated personalized medicine research.
Collapse
Affiliation(s)
- Célia Escribe
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Québec, Canada.,Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA, United States
| | - Tianyuan Lu
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Québec, Canada.,Quantitative Life Sciences Program, McGill University, Montreal, Québec, Canada
| | - Julyan Keller-Baruch
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Québec, Canada.,Department of Human Genetics, McGill University, Montreal, Québec, Canada
| | - Vincenzo Forgetta
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Québec, Canada
| | - Bowei Xiao
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Québec, Canada.,Quantitative Life Sciences Program, McGill University, Montreal, Québec, Canada
| | - J Brent Richards
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Québec, Canada.,Department of Human Genetics, McGill University, Montreal, Québec, Canada.,Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Québec, Canada.,Department of Twin Research and Genetic Epidemiology, King's College London, London, United Kingdom
| | - Sahir Bhatnagar
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Québec, Canada.,Department of Diagnostic Radiology, McGill University, Montreal, Québec, Canada
| | - Karim Oualkacha
- Département de Mathématiques, Université du Québec à Montréal, Montreal, Québec, Canada
| | - Celia M T Greenwood
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Québec, Canada.,Department of Human Genetics, McGill University, Montreal, Québec, Canada.,Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Québec, Canada.,Gerald Bronfman Department of Oncology, McGill University, Montreal, Québec, Canada
| |
Collapse
|
4
|
Lu AA, Chen Y, Gao X. Broad Coverage Precoding for 3D Massive MIMO with Huge Uniform Planar Arrays. Entropy (Basel) 2021; 23:e23070887. [PMID: 34356428 PMCID: PMC8304731 DOI: 10.3390/e23070887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/12/2021] [Revised: 07/04/2021] [Accepted: 07/06/2021] [Indexed: 11/20/2022]
Abstract
In this paper, we propose a novel broad coverage precoder design for three-dimensional (3D) massive multi-input multi-output (MIMO) equipped with huge uniform planar arrays (UPAs). The desired two-dimensional (2D) angle power spectrum is assumed to be separable. We use the per-antenna constant power constraint and the semi-unitary constraint which are widely used in the literature. For normal broad coverage precoder design, the dimension of the optimization space is the product of the number of antennas at the base station (BS) and the number of transmit streams. With the proposed method, the design of the high-dimensional precoding matrices is reduced to that of a set of low-dimensional orthonormal vectors, and of a pair of low-dimensional vectors. The dimensions of the vectors in the set and the pair are the number of antennas per column and per row of the UPA, respectively. We then use optimization methods to generate the set of orthonormal vectors and the pair of vectors, respectively. Finally, simulation results show that the proposed broad coverage precoding matrices achieve nearly the same performance as the normal broad coverage precoder with much lower computational complexity.
Collapse
Affiliation(s)
- An-An Lu
- National Mobile Communications Research Laboratory (NCRL), Southeast University, Nanjing 210096, China; (A.-A.L.); (Y.C.)
- Purple Mountain Laboratories, Nanjing 211111, China
| | - Yan Chen
- National Mobile Communications Research Laboratory (NCRL), Southeast University, Nanjing 210096, China; (A.-A.L.); (Y.C.)
| | - Xiqi Gao
- National Mobile Communications Research Laboratory (NCRL), Southeast University, Nanjing 210096, China; (A.-A.L.); (Y.C.)
- Purple Mountain Laboratories, Nanjing 211111, China
- Correspondence:
| |
Collapse
|
5
|
Su T, Wang Y, Liu Y, Branton WG, Asahchop E, Power C, Jiang B, Kong L, Tang N. Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra- High Dimensions. Entropy (Basel) 2020; 22:E1257. [PMID: 33287025 PMCID: PMC7712546 DOI: 10.3390/e22111257] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 10/26/2020] [Accepted: 11/02/2020] [Indexed: 11/21/2022]
Abstract
Distance weighted discrimination (DWD) is an appealing classification method that is capable of overcoming data piling problems in high-dimensional settings. Especially when various sparsity structures are assumed in these settings, variable selection in multicategory classification poses great challenges. In this paper, we propose a multicategory generalized DWD (MgDWD) method that maintains intrinsic variable group structures during selection using a sparse group lasso penalty. Theoretically, we derive minimizer uniqueness for the penalized MgDWD loss function and consistency properties for the proposed classifier. We further develop an efficient algorithm based on the proximal operator to solve the optimization problem. The performance of MgDWD is evaluated using finite sample simulations and miRNA data from an HIV study.
Collapse
Affiliation(s)
- Tong Su
- Key Lab of Statistical Modeling and Data Analysis of Yunnan Province, Yunnan University, Kunming 650091, China;
| | - Yafei Wang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada; (Y.W.); (Y.L.); (B.J.)
| | - Yi Liu
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada; (Y.W.); (Y.L.); (B.J.)
| | - William G. Branton
- Department of Medicine (Neurology), University of Alberta, Edmonton, AB T6G 2G1, Canada; (W.G.B.); (E.A.); (C.P.)
| | - Eugene Asahchop
- Department of Medicine (Neurology), University of Alberta, Edmonton, AB T6G 2G1, Canada; (W.G.B.); (E.A.); (C.P.)
| | - Christopher Power
- Department of Medicine (Neurology), University of Alberta, Edmonton, AB T6G 2G1, Canada; (W.G.B.); (E.A.); (C.P.)
| | - Bei Jiang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada; (Y.W.); (Y.L.); (B.J.)
| | - Linglong Kong
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada; (Y.W.); (Y.L.); (B.J.)
| | - Niansheng Tang
- Key Lab of Statistical Modeling and Data Analysis of Yunnan Province, Yunnan University, Kunming 650091, China;
| |
Collapse
|
6
|
Mirkes EM, Allohibi J, Gorban A. Fractional Norms and Quasinorms Do Not Help to Overcome the Curse of Dimensionality. Entropy (Basel) 2020; 22:E1105. [PMID: 33286874 PMCID: PMC7597215 DOI: 10.3390/e22101105] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 09/22/2020] [Accepted: 09/27/2020] [Indexed: 11/25/2022]
Abstract
The curse of dimensionality causes the well-known and widely discussed problems for machine learning methods. There is a hypothesis that using the Manhattan distance and even fractional lp quasinorms (for p less than 1) can help to overcome the curse of dimensionality in classification problems. In this study, we systematically test this hypothesis. It is illustrated that fractional quasinorms have a greater relative contrast and coefficient of variation than the Euclidean norm l2, but it is shown that this difference decays with increasing space dimension. It has been demonstrated that the concentration of distances shows qualitatively the same behaviour for all tested norms and quasinorms. It is shown that a greater relative contrast does not mean a better classification quality. It was revealed that for different databases the best (worst) performance was achieved under different norms (quasinorms). A systematic comparison shows that the difference in the performance of kNN classifiers for lp at p = 0.5, 1, and 2 is statistically insignificant. Analysis of curse and blessing of dimensionality requires careful definition of data dimensionality that rarely coincides with the number of attributes. We systematically examined several intrinsic dimensions of the data.
Collapse
Affiliation(s)
- Evgeny M. Mirkes
- School of Mathematics and Actuarial Science, University of Leicester, Leicester LE1 7HR, UK; (J.A.); (A.G.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky State University, 603105 Nizhny Novgorod, Russia
| | - Jeza Allohibi
- School of Mathematics and Actuarial Science, University of Leicester, Leicester LE1 7HR, UK; (J.A.); (A.G.)
- Department of Mathematics, Taibah University, Janadah Bin Umayyah Road, Tayba, Medina 42353, Saudi Arabia
| | - Alexander Gorban
- School of Mathematics and Actuarial Science, University of Leicester, Leicester LE1 7HR, UK; (J.A.); (A.G.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky State University, 603105 Nizhny Novgorod, Russia
| |
Collapse
|
7
|
Abstract
Data from a large number of covariates with known population totals are frequently observed in survey studies. These auxiliary variables contain valuable information that can be incorporated into estimation of the population total of a survey variable to improve the estimation precision. We consider the generalized regression estimator formulated under the model-assisted framework in which a regression model is utilized to make use of the available covariates while the estimator still has basic design-based properties. The generalized regression estimator has been shown to improve the efficiency of the design-based Horvitz-Thompson estimator when the number of covariates is fixed. In this study, we investigate the performance of the generalized regression estimator when the number of covariates p is allowed to diverge as the sample size n increases. We examine two approaches where the model parameter is estimated using the weighted least squares method when p < n and the LASSO method when the model parameter is sparse. We show that under an assisted model and certain conditions on the joint distribution of the covariates as well as the divergence rates of n and p, the generalized regression estimator is asymptotically more efficient than the Horvitz-Thompson estimator, and is robust against model misspecification. We also study the consistency of variance estimation for the generalized regression estimator. Our theoretical results are corroborated by simulation studies and an example.
Collapse
Affiliation(s)
- Tram Ta
- Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, U.S.A
| | - Jun Shao
- Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, U.S.A
- School of Statistics, East China Normal University, Shanghai 200241, China & Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, U.S.A
| | - Quefeng Li
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, U.S.A
| | - Lei Wang
- School of Statistics and Data Science & LPMC, Nankai University, Tianjin 300071, China
| |
Collapse
|
8
|
Causeur D, Sheu CF, Perthame E, Rufini F. A functional generalized F-test for signal detection with applications to event-related potentials significance analysis. Biometrics 2019; 76:246-256. [PMID: 31301147 DOI: 10.1111/biom.13118] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Accepted: 07/02/2019] [Indexed: 11/28/2022]
Abstract
Motivated by the analysis of complex dependent functional data such as event-related brain potentials (ERP), this paper considers a time-varying coefficient multivariate regression model with fixed-time covariates for testing global hypotheses about population mean curves. Based on a reduced-rank modeling of the time correlation of the stochastic process of pointwise test statistics, a functional generalized F-test is proposed and its asymptotic null distribution is derived. Our analytical results show that the proposed test is more powerful than functional analysis of variance testing methods and competing signal detection procedures for dependent data. Simulation studies confirm such power gain for data with patterns of dependence similar to those observed in ERPs. The new testing procedure is illustrated with an analysis of the ERP data from a study of neural correlates of impulse control.
Collapse
Affiliation(s)
- David Causeur
- IRMAR UMR CNRS 6625, Agrocampus Ouest, Rennes Cedex, France
| | - Ching-Fan Sheu
- Institute of Education, National Cheng Kung University, Tainan, Taiwan
| | - Emeline Perthame
- Bioinformatique et Biostatistique, Bioinformatics and Biostatistics Hub C3BI, USR 3756 IP CNRS, Institut Pasteur, Paris, France
| | - Flavia Rufini
- Department of Statistics and Computer Science, Agrocampus Ouest, Rennes Cedex, France
| |
Collapse
|
9
|
Abstract
Developing algorithms for solving high-dimensional partial differential equations (PDEs) has been an exceedingly difficult task for a long time, due to the notoriously difficult problem known as the "curse of dimensionality." This paper introduces a deep learning-based approach that can handle general high-dimensional parabolic PDEs. To this end, the PDEs are reformulated using backward stochastic differential equations and the gradient of the unknown solution is approximated by neural networks, very much in the spirit of deep reinforcement learning with the gradient acting as the policy function. Numerical results on examples including the nonlinear Black-Scholes equation, the Hamilton-Jacobi-Bellman equation, and the Allen-Cahn equation suggest that the proposed algorithm is quite effective in high dimensions, in terms of both accuracy and cost. This opens up possibilities in economics, finance, operational research, and physics, by considering all participating agents, assets, resources, or particles together at the same time, instead of making ad hoc assumptions on their interrelationships.
Collapse
Affiliation(s)
- Jiequn Han
- Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ 08544
| | - Arnulf Jentzen
- Seminar for Applied Mathematics, Department of Mathematics, ETH Zürich, 8092 Zürich, Switzerland
| | - Weinan E
- Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ 08544;
- Department of Mathematics, Princeton University, Princeton, NJ 08544
- Beijing Institute of Big Data Research, Beijing 100871, China
| |
Collapse
|
10
|
Jiang L, Amir A, Morton JT, Heller R, Arias-Castro E, Knight R. Discrete False-Discovery Rate Improves Identification of Differentially Abundant Microbes. mSystems 2017; 2:e00092-17. [PMID: 29181446 DOI: 10.1128/mSystems.00092-17] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2017] [Accepted: 10/29/2017] [Indexed: 12/21/2022] Open
Abstract
DS-FDR can achieve higher statistical power to detect significant findings in sparse and noisy microbiome data compared to the commonly used Benjamini-Hochberg procedure and other FDR-controlling procedures. Differential abundance testing is a critical task in microbiome studies that is complicated by the sparsity of data matrices. Here we adapt for microbiome studies a solution from the field of gene expression analysis to produce a new method, discrete false-discovery rate (DS-FDR), that greatly improves the power to detect differential taxa by exploiting the discreteness of the data. Additionally, DS-FDR is relatively robust to the number of noninformative features, and thus removes the problem of filtering taxonomy tables by an arbitrary abundance threshold. We show by using a combination of simulations and reanalysis of nine real-world microbiome data sets that this new method outperforms existing methods at the differential abundance testing task, producing a false-discovery rate that is up to threefold more accurate, and halves the number of samples required to find a given difference (thus increasing the efficiency of microbiome experiments considerably). We therefore expect DS-FDR to be widely applied in microbiome studies. IMPORTANCE DS-FDR can achieve higher statistical power to detect significant findings in sparse and noisy microbiome data compared to the commonly used Benjamini-Hochberg procedure and other FDR-controlling procedures.
Collapse
|
11
|
Wang X, Wang M. Adaptive group bridge estimation for high-dimensional partially linear models. J Inequal Appl 2017; 2017:158. [PMID: 28725135 PMCID: PMC5493733 DOI: 10.1186/s13660-017-1432-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2017] [Accepted: 06/16/2017] [Indexed: 06/07/2023]
Abstract
This paper studies group selection for the partially linear model with a diverging number of parameters. We propose an adaptive group bridge method and study the consistency, convergence rate and asymptotic distribution of the global adaptive group bridge estimator under regularity conditions. Simulation studies and a real example show the finite sample performance of our method.
Collapse
Affiliation(s)
- Xiuli Wang
- School of Statistics, Qufu Normal University, Jingxuan West Road, Qufu, 273165 P.R. China
| | - Mingqiu Wang
- School of Statistics, Qufu Normal University, Jingxuan West Road, Qufu, 273165 P.R. China
| |
Collapse
|
12
|
Yin X, Levy D, Willinger C, Adourian A, Larson MG. Multiple imputation and analysis for high-dimensional incomplete proteomics data. Stat Med 2015; 35:1315-26. [PMID: 26565662 PMCID: PMC4777663 DOI: 10.1002/sim.6800] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2015] [Revised: 08/12/2015] [Accepted: 10/19/2015] [Indexed: 12/11/2022]
Abstract
Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case–control study of 135 incident cases of myocardial infarction and 135 pair‐matched controls from the Framingham Heart Study Offspring cohort. Plasma protein markers (K = 861) were measured on the case–control pairs (N = 135), and the majority of proteins had missing expression values for a subset of samples. In the setting of many more variables than observations (K ≫ N), we explored and documented the feasibility of multiple imputation approaches along with subsequent analysis of the imputed data sets. Initially, we selected proteins with complete expression data (K = 261) and randomly masked some values as the basis of simulation to tune the imputation and analysis process. We randomly shuffled proteins into several bins, performed multiple imputation within each bin, and followed up with stepwise selection using conditional logistic regression within each bin. This process was repeated hundreds of times. We determined the optimal method of multiple imputation, number of proteins per bin, and number of random shuffles using several performance statistics. We then applied this method to 544 proteins with incomplete expression data (≤40% missing values), from which we identified a panel of seven proteins that were jointly associated with myocardial infarction. © 2015 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.
Collapse
Affiliation(s)
- Xiaoyan Yin
- The Framingham Heart Study, National Heart, Lung, and Blood Institute, Framingham, MA, U.S.A.,Department of Biostatistics, School of Public Health, Boston University, Boston, MA, U.S.A.,Department of Cardiology, Boston University, Boston, MA, U.S.A
| | - Daniel Levy
- The Framingham Heart Study, National Heart, Lung, and Blood Institute, Framingham, MA, U.S.A.,Population Sciences Branch, Division of Intramural Research, National Heart, Lung, and Blood Institute, Boston, MA, U.S.A
| | - Christine Willinger
- The Framingham Heart Study, National Heart, Lung, and Blood Institute, Framingham, MA, U.S.A.,Population Sciences Branch, Division of Intramural Research, National Heart, Lung, and Blood Institute, Boston, MA, U.S.A
| | | | - Martin G Larson
- The Framingham Heart Study, National Heart, Lung, and Blood Institute, Framingham, MA, U.S.A.,Department of Biostatistics, School of Public Health, Boston University, Boston, MA, U.S.A.,Department of Mathematics and Statistics, Boston University, Boston, MA, U.S.A
| |
Collapse
|
13
|
Liu H, Wang L, Zhao T. Calibrated Multivariate Regression with Application to Neural Semantic Basis Discovery. J Mach Learn Res 2015; 16:1579-1606. [PMID: 28316509 PMCID: PMC5354374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
We propose a calibrated multivariate regression method named CMR for fitting high dimensional multivariate regression models. Compared with existing methods, CMR calibrates regularization for each regression task with respect to its noise level so that it simultaneously attains improved finite-sample performance and tuning insensitiveness. Theoretically, we provide sufficient conditions under which CMR achieves the optimal rate of convergence in parameter estimation. Computationally, we propose an efficient smoothed proximal gradient algorithm with a worst-case numerical rate of convergence O(1/ϵ), where ϵ is a pre-specified accuracy of the objective function value. We conduct thorough numerical simulations to illustrate that CMR consistently outperforms other high dimensional multivariate regression methods. We also apply CMR to solve a brain activity prediction problem and find that it is as competitive as a handcrafted model created by human experts. The R package camel implementing the proposed method is available on the Comprehensive R Archive Network http://cran.r-project.org/web/packages/camel/.
Collapse
Affiliation(s)
- Han Liu
- Department of Operations Research and Financial Engineering, Princeton University, NJ 08544, USA,
| | - Lie Wang
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge MA 02139, USA,
| | | |
Collapse
|
14
|
Lee W, Liu Y. Joint Estimation of Multiple Precision Matrices with Common Structures. J Mach Learn Res 2015; 16:1035-1062. [PMID: 26568704 PMCID: PMC4643293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Estimation of inverse covariance matrices, known as precision matrices, is important in various areas of statistical analysis. In this article, we consider estimation of multiple precision matrices sharing some common structures. In this setting, estimating each precision matrix separately can be suboptimal as it ignores potential common structures. This article proposes a new approach to parameterize each precision matrix as a sum of common and unique components and estimate multiple precision matrices in a constrained l1 minimization framework. We establish both estimation and selection consistency of the proposed estimator in the high dimensional setting. The proposed estimator achieves a faster convergence rate for the common structure in certain cases. Our numerical examples demonstrate that our new estimator can perform better than several existing methods in terms of the entropy loss and Frobenius loss. An application to a glioblastoma cancer data set reveals some interesting gene networks across multiple cancer subtypes.
Collapse
Affiliation(s)
- Wonyul Lee
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599-3260, USA
| | - Yufeng Liu
- Department of Statistics and Operations Research, Department of Genetics, Department of Biostatistics, Carolina Center for Genome Sciences, University of North Carolina, Chapel Hill, NC 27599-3260, USA
| |
Collapse
|
15
|
Murphy TE, McAvay G, Carriero NJ, Gross CP, Tinetti ME, Allore HG, Lin H. Deaths observed in Medicare beneficiaries: average attributable fraction and its longitudinal extension for many diseases. Stat Med 2012; 31:3313-9. [PMID: 22415597 PMCID: PMC3719164 DOI: 10.1002/sim.5337] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2012] [Accepted: 01/11/2012] [Indexed: 11/11/2022]
Abstract
Calculating the longitudinal extension of the average attributable fraction (LE-AAF) for many risk factors (RFs) requires a two-stage computational process using only those combinations of RFs observed in the dataset. We first screen candidates RFs in a Cox Model, and assuming piecewise constant hazards, use pooled logistic regression to model the probability of death as a function of combinations of selected RFs. We average the iterative differencing of the attributable fractions calculated for all overlapping subsets of co-occurring RFs to obtain a LE-AAF for each RF that is additive and symmetrical. We illustrate by partitioning the additive proportions of death from 10 different groupings of acute and chronic diseases, on a national sample of older persons from the US (Medicare Beneficiary Survey) over a 4-year period and compare with results reported by the National Center for Healthcare Statistics. We conclude that careful screening of RFs with analysis restricted to extant combinations greatly reduces computational burden. LE-AAF accounted for a cumulative total of 66% of the deaths in our sample, compared with the 83% accounted for by the National Center for Healthcare Statistics.
Collapse
Affiliation(s)
- T E Murphy
- Department of Internal Medicine and the Program on Aging, Yale University School of Medicine, New Haven, CT, USA.
| | | | | | | | | | | | | |
Collapse
|
16
|
Abstract
A new comprehensive procedure for statistical analysis of two-dimensional polyacrylamide gel electrophoresis (2D PAGE) images is proposed, including protein region quantification, normalization and statistical analysis. Protein regions are defined by the master watershed map that is obtained from the mean gel. By working with these protein regions, the approach bypasses the current bottleneck in the analysis of 2D PAGE images: it does not require spot matching. Background correction is implemented in each protein region by local segmentation. Two-dimensional locally weighted smoothing (LOESS) is proposed to remove any systematic bias after quantification of protein regions. Proteins are separated into mutually independent sets based on detected correlations, and a multivariate analysis is used on each set to detect the group effect. A strategy for multiple hypothesis testing based on this multivariate approach combined with the usual Benjamini-Hochberg FDR procedure is formulated and applied to the differential analysis of 2D PAGE images. Each step in the analytical protocol is shown by using an actual dataset. The effectiveness of the proposed methodology is shown using simulated gels in comparison with the commercial software packages PDQuest and Dymension. We also introduce a new procedure for simulating gel images.
Collapse
Affiliation(s)
- Feng Li
- Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland, USA
| | - Françoise Seillier-Moiseiwitsch
- Infectious Disease Clinical Research Program, Department of Preventive Medicine and Biometrics, Uniformed Services University of the Health Sciences, Bethesda, Maryland, USA
| | - Valeriy R. Korostyshevskiy
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, DC, USA
| |
Collapse
|