1
|
Guo W, Li X, Qin K, Zhang P, He J, Liu Y, Yang X, Wu S. Nanopore sequencing demonstrates the roles of spermatozoal DNA N6-methyladenine in mediating transgenerational lipid metabolism disorder induced by excessive folate consumpton. Poult Sci 2024; 103:103953. [PMID: 38945000 PMCID: PMC11267017 DOI: 10.1016/j.psj.2024.103953] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Revised: 05/31/2024] [Accepted: 06/03/2024] [Indexed: 07/02/2024] Open
Abstract
Increased consumption of folic acid is prevalent due to its beneficial effects, but growing evidence emphasizes the side effects pointing to excessive dietary folate intake. The effects of excessive paternal folic acid consumption on offspring and its transgenerational inheritance mechanism have not been elucidated. We hypothesize that excessive folic acid consumption will alter sperm DNA N6-methyladenine (6mA) and 5-methylcytosine (5mC) methylation and heritably influence offspring metabolic homeostasis. Here, we fed roosters either folic acid-control or folic acid-excess diet throughout life. Paternal chronic folic acid excessive supplementation increased hepatic lipogenesis and lipid accumulation but reduced lipolysis both in the roosters and their offspring, which was further confirmed to be induced by one-carbon metabolism inhibition and gene expression alteration associated with the Peroxisome proliferator-activated receptor pathway. Based on the spermatozoal genome-wide DNA methylome identified by Nanopore sequencing, multi-omics association analysis of spermatozoal and hepatic DNA methylome, transcriptome, and metabolome suggested that differential spermatozoal DNA 6mA and 5mC methylation could be involved in regulating lipid metabolism-related gene expression in offspring chickens. This model suggests that sperm DNA N6-methyladenine and 5-methylcytosine methylation were involved in epigenetic transmission and that paternal dietary excess folic acid leads to hepatic lipid accumulation in offspring.
Collapse
Affiliation(s)
- Wei Guo
- Jiangsu Institute of Poultry Science, Yangzhou, Jiangsu Province, 225125, China; College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Xinyi Li
- College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China; Department of Medicine, Karolinska Institutet, Solna, Stockholm, 17165, Sweden
| | - Kailong Qin
- College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Peilin Zhang
- College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Jinhui He
- College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Yanli Liu
- College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Xiaojun Yang
- College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Shengru Wu
- College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China; Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Stockholm, 17165, Sweden.
| |
Collapse
|
2
|
Wang H, Li N, Zhou Y, Yan J, Jiang B, Kong L, Yan X. Fast Fusion Clustering via Double Random Projection. ENTROPY (BASEL, SWITZERLAND) 2024; 26:376. [PMID: 38785624 PMCID: PMC11119451 DOI: 10.3390/e26050376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 04/25/2024] [Accepted: 04/25/2024] [Indexed: 05/25/2024]
Abstract
In unsupervised learning, clustering is a common starting point for data processing. The convex or concave fusion clustering method is a novel approach that is more stable and accurate than traditional methods such as k-means and hierarchical clustering. However, the optimization algorithm used with this method can be slowed down significantly by the complexity of the fusion penalty, which increases the computational burden. This paper introduces a random projection ADMM algorithm based on the Bernoulli distribution and develops a double random projection ADMM method for high-dimensional fusion clustering. These new approaches significantly outperform the classical ADMM algorithm due to their ability to significantly increase computational speed by reducing complexity and improving clustering accuracy by using multiple random projections under a new evaluation criterion. We also demonstrate the convergence of our new algorithm and test its performance on both simulated and real data examples.
Collapse
Affiliation(s)
- Hongni Wang
- School of Statistics and Mathematics, Shandong University of Finance and Economics, Jinan 250014, China; (H.W.); (N.L.)
| | - Na Li
- School of Statistics and Mathematics, Shandong University of Finance and Economics, Jinan 250014, China; (H.W.); (N.L.)
| | - Yanqiu Zhou
- School of Science, Guangxi University of Science and Technology, Liuzhou 545006, China;
| | - Jingxin Yan
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China;
| | - Bei Jiang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada;
| | - Linglong Kong
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada;
| | - Xiaodong Yan
- Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan 250100, China
| |
Collapse
|
3
|
Zhang W, Wendt C, Bowler R, Hersh CP, Safo SE. Robust integrative biclustering for multi-view data. Stat Methods Med Res 2022; 31:2201-2216. [PMID: 36113157 PMCID: PMC10153449 DOI: 10.1177/09622802221122427] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In many biomedical research, multiple views of data (e.g. genomics, proteomics) are available, and a particular interest might be the detection of sample subgroups characterized by specific groups of variables. Biclustering methods are well-suited for this problem as they assume that specific groups of variables might be relevant only to specific groups of samples. Many biclustering methods exist for detecting row-column clusters in a view but few methods exist for data from multiple views. The few existing algorithms are heavily dependent on regularization parameters for getting row-column clusters, and they impose unnecessary burden on users thus limiting their use in practice. We extend an existing biclustering method based on sparse singular value decomposition for single-view data to data from multiple views. Our method, integrative sparse singular value decomposition (iSSVD), incorporates stability selection to control Type I error rates, estimates the probability of samples and variables to belong to a bicluster, finds stable biclusters, and results in interpretable row-column associations. Simulations and real data analyses show that integrative sparse singular value decomposition outperforms several other single- and multi-view biclustering methods and is able to detect meaningful biclusters. iSSVD is a user-friendly, computationally efficient algorithm that will be useful in many disease subtyping applications.
Collapse
Affiliation(s)
- Weijie Zhang
- Division of Biostatistics, 5635University of Minnesota, MN, USA
| | - Christine Wendt
- Division of Pulmonary, Allergy and Critical Care, 5635University of Minnesota, MN, USA
| | - Russel Bowler
- Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, 551774National Jewish Health, Denver, USA
| | - Craig P Hersh
- Channing Division of Network Medicine, Brigham and Women's Hospital, 1811Harvard Medical School, USA
| | - Sandra E Safo
- Division of Biostatistics, 5635University of Minnesota, MN, USA
| |
Collapse
|
4
|
Duan R, Gao L, Gao Y, Hu Y, Xu H, Huang M, Song K, Wang H, Dong Y, Jiang C, Zhang C, Jia S. Evaluation and comparison of multi-omics data integration methods for cancer subtyping. PLoS Comput Biol 2021; 17:e1009224. [PMID: 34383739 PMCID: PMC8384175 DOI: 10.1371/journal.pcbi.1009224] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Revised: 08/24/2021] [Accepted: 06/28/2021] [Indexed: 11/18/2022] Open
Abstract
Computational integrative analysis has become a significant approach in the data-driven exploration of biological problems. Many integration methods for cancer subtyping have been proposed, but evaluating these methods has become a complicated problem due to the lack of gold standards. Moreover, questions of practical importance remain to be addressed regarding the impact of selecting appropriate data types and combinations on the performance of integrative studies. Here, we constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types. Using these datasets, we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency. We subsequently investigated the influence of different omics data on cancer subtyping and the effectiveness of their combinations. Refuting the widely held intuition that incorporating more types of omics data always produces better results, our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. Our analyses also suggested several effective combinations for most cancers under our studies, which may be of particular interest to researchers in omics data analysis. Cancer is one of the most heterogeneous diseases, characterized by diverse morphological, phenotypic, and genomic profiles between tumors and their subtypes. Identifying cancer subtypes can help patients receive precise treatments. With the development of high-throughput technologies, genomics, epigenomics, and transcriptomics data have been generated for large cancer patient cohorts. It is believed that the more omics data we use, the more accurate identification of cancer subtypes. To examine this assumption, we first constructed three classes of benchmarking datasets to conduct a comprehensive evaluation and comparison of ten representative multi-omics data integration methods for cancer subtyping by considering their accuracy, robustness, and computational efficiency. Then, we investigated the influence of different omics data and their various combinations on the effectiveness of cancer subtyping. Our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. We hope that our work may help researchers choose a proper method and an effective data combination when identifying cancer subtypes using data integration methods.
Collapse
Affiliation(s)
- Ran Duan
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi’an, China
- * E-mail:
| | - Yong Gao
- Department of Computer Science, The University of British Columbia Okanagan, Kelowna, British Columbia, Canada
| | - Yuxuan Hu
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Han Xu
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Mingfeng Huang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Kuo Song
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Hongda Wang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Yongqiang Dong
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Chaoqun Jiang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Chenxing Zhang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Songwei Jia
- School of Computer Science and Technology, Xidian University, Xi’an, China
| |
Collapse
|
5
|
Park JY, Lock EF. Integrative factorization of bidimensionally linked matrices. Biometrics 2020; 76:61-74. [PMID: 31444786 PMCID: PMC7036334 DOI: 10.1111/biom.13141] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2019] [Accepted: 08/19/2019] [Indexed: 02/02/2023]
Abstract
Advances in molecular "omics" technologies have motivated new methodologies for the integration of multiple sources of high-content biomedical data. However, most statistical methods for integrating multiple data matrices only consider data shared vertically (one cohort on multiple platforms) or horizontally (different cohorts on a single platform). This is limiting for data that take the form of bidimensionally linked matrices (eg, multiple cohorts measured on multiple platforms), which are increasingly common in large-scale biomedical studies. In this paper, we propose bidimensional integrative factorization (BIDIFAC) for integrative dimension reduction and signal approximation of bidimensionally linked data matrices. Our method factorizes data into (a) globally shared, (b) row-shared, (c) column-shared, and (d) single-matrix structural components, facilitating the investigation of shared and unique patterns of variability. For estimation, we use a penalized objective function that extends the nuclear norm penalization for a single matrix. As an alternative to the complicated rank selection problem, we use results from the random matrix theory to choose tuning parameters. We apply our method to integrate two genomics platforms (messenger RNA and microRNA expression) across two sample cohorts (tumor samples and normal tissue samples) using the breast cancer data from the Cancer Genome Atlas. We provide R code for fitting BIDIFAC, imputing missing values, and generating simulated data.
Collapse
Affiliation(s)
- Jun Young Park
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| | - Eric F Lock
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| |
Collapse
|
6
|
Massive integrative gene set analysis enables functional characterization of breast cancer subtypes. J Biomed Inform 2019; 93:103157. [PMID: 30928514 DOI: 10.1016/j.jbi.2019.103157] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2018] [Revised: 03/11/2019] [Accepted: 03/22/2019] [Indexed: 01/31/2023]
Abstract
The availability of large-scale repositories and integrated cancer genome efforts have created unprecedented opportunities to study and describe cancer biology. In this sense, the aim of translational researchers is the integration of multiple omics data to achieve a better identification of homogeneous subgroups of patients in order to develop adequate diagnostic and treatment strategies from the personalized medicine perspective. So far, existing integrative methods have grouped together omics data information, leaving out individual omics data phenotypic interpretation. Here, we present the Massive and Integrative Gene Set Analysis (MIGSA) R package. This tool can analyze several high throughput experiments in a comprehensive way through a functional analysis strategy, relating a phenotype to its biological function counterpart defined by means of gene sets. By simultaneously querying different multiple omics data from the same or different groups of patients, common and specific functional patterns for each studied phenotype can be obtained. The usefulness of MIGSA was demonstrated by applying the package to functionally characterize the intrinsic breast cancer PAM50 subtypes. For each subtype, specific functional transcriptomic profiles and gene sets enriched by transcriptomic and proteomic data were identified. To achieve this, transcriptomic and proteomic data from 28 datasets were analyzed using MIGSA. As a result, enriched gene sets and important genes were consistently found as related to a specific subtype across experiments or data types and thus can be used as molecular signature biomarkers.
Collapse
|
7
|
Yang Z, Michailidis G. Quantifying heterogeneity of expression data based on principal components. Bioinformatics 2019; 35:553-559. [PMID: 30060088 DOI: 10.1093/bioinformatics/bty671] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2017] [Revised: 07/05/2018] [Accepted: 07/27/2018] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The diversity of biological omics data provides richness of information, but also presents an analytic challenge. While there has been much methodological and theoretical development on the statistical handling of large volumes of biological data, far less attention has been devoted to characterizing their veracity and variability. RESULTS We propose a method of statistically quantifying heterogeneity among multiple groups of datasets, derived from different omics modalities over various experimental and/or disease conditions. It draws upon strategies from analysis of variance and principal component analysis in order to reduce dimensionality of the variability across multiple data groups. The resulting hypothesis-based inference procedure is demonstrated with synthetic and real data from a cell line study of growth factor responsiveness based on a factorial experimental design. AVAILABILITY AND IMPLEMENTATION Source code and datasets are freely available at https://github.com/yangzi4/gPCA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zi Yang
- Department of Statistics, University of Michigan, Ann Arbor, MI, USA
| | | |
Collapse
|
8
|
Liu Y, Wang R, He X, Dai H, Betts RJ, Marionnet C, Bernerd F, Planel E, Wang X, Nocairi H, Cai Z, Qiu J, Ding C. Validation of a predictive method for sunscreen formula evaluation using gene expression analysis in a Chinese reconstructed full-thickness skin model. Int J Cosmet Sci 2019; 41:147-155. [PMID: 30719735 DOI: 10.1111/ics.12518] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2018] [Revised: 01/17/2019] [Accepted: 01/30/2019] [Indexed: 11/30/2022]
Abstract
OBJECTIVE This study aimed to establish a predictive in vitro method for assessing the photoprotective properties of sunscreens using a reconstructed full-thickness skin model. MATERIALS AND METHODS A full-thickness skin model reconstructed with human fibroblasts and keratinocytes isolated from Chinese skin was exposed to daily UV radiation (DUVR). We examined the transcriptomic response, identifying genes for which expression was modulated by DUVR in a dose-dependent manner. We then validated the methodology for efficacy evaluation of different sunscreens formulas. RESULTS The reconstructed skin model was histologically consistent with human skin, and upon DUVR exposure, the constituent fibroblasts and keratinocytes exhibited transcriptomic alterations in pathways associated with oxidative stress, inflammation and extracellular matrix remodelling. When used to evaluate sunscreen protection on the model, the observed level of protection from UV-induced gene expression was consistent with the corresponding protection factors determined clinically and allowed for statistical ranking of sunscreen efficacy. CONCLUSIONS Within this study we show that quantification of gene modulation within the reconstructed skin model is a biologically relevant approach with sensitivity and predictability to evaluate photoprotection products.
Collapse
Affiliation(s)
- Y Liu
- L'Oréal Research and Innovation, 550 Jin Yu Road, Pudong, Shanghai, P.R. China
| | - R Wang
- L'Oréal Research and Innovation, 550 Jin Yu Road, Pudong, Shanghai, P.R. China
| | - X He
- L'Oréal Research and Innovation, 550 Jin Yu Road, Pudong, Shanghai, P.R. China
| | - H Dai
- L'Oréal Research and Innovation, 550 Jin Yu Road, Pudong, Shanghai, P.R. China
| | - R J Betts
- L'Oréal Research and Innovation, 550 Jin Yu Road, Pudong, Shanghai, P.R. China
| | - C Marionnet
- L'Oréal Research and Innovation, 1 Avenue Eugene Schueller, 93601, Aulnay-sous-Bois, France
| | - F Bernerd
- L'Oréal Research and Innovation, 1 Avenue Eugene Schueller, 93601, Aulnay-sous-Bois, France
| | - E Planel
- L'Oréal Research and Innovation, 1 Avenue Eugene Schueller, 93601, Aulnay-sous-Bois, France
| | - X Wang
- L'Oréal Research and Innovation, 550 Jin Yu Road, Pudong, Shanghai, P.R. China
| | - H Nocairi
- L'Oréal Research and Innovation, 1 Avenue Eugene Schueller, 93601, Aulnay-sous-Bois, France
| | - Z Cai
- L'Oréal Research and Innovation, 550 Jin Yu Road, Pudong, Shanghai, P.R. China
| | - J Qiu
- L'Oréal Research and Innovation, 550 Jin Yu Road, Pudong, Shanghai, P.R. China
| | - C Ding
- L'Oréal Research and Innovation, 550 Jin Yu Road, Pudong, Shanghai, P.R. China
| |
Collapse
|
9
|
Jain Y, Ding S, Qiu J. Sliced inverse regression for integrative multi-omics data analysis. Stat Appl Genet Mol Biol 2019; 18:/j/sagmb.ahead-of-print/sagmb-2018-0028/sagmb-2018-0028.xml. [PMID: 30685747 DOI: 10.1515/sagmb-2018-0028] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Advancement in next-generation sequencing, transcriptomics, proteomics and other high-throughput technologies has enabled simultaneous measurement of multiple types of genomic data for cancer samples. These data together may reveal new biological insights as compared to analyzing one single genome type data. This study proposes a novel use of supervised dimension reduction method, called sliced inverse regression, to multi-omics data analysis to improve prediction over a single data type analysis. The study further proposes an integrative sliced inverse regression method (integrative SIR) for simultaneous analysis of multiple omics data types of cancer samples, including MiRNA, MRNA and proteomics, to achieve integrative dimension reduction and to further improve prediction performance. Numerical results show that integrative analysis of multi-omics data is beneficial as compared to single data source analysis, and more importantly, that supervised dimension reduction methods possess advantages in integrative data analysis in terms of classification and prediction as compared to unsupervised dimension reduction methods.
Collapse
Affiliation(s)
- Yashita Jain
- Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA
| | - Shanshan Ding
- Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA.,Department of Applied Economics and Statistics, University of Delaware, 531 S College Ave., Newark, DE 19711, USA
| | - Jing Qiu
- Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA.,Department of Applied Economics and Statistics, University of Delaware, 531 S College Ave., Newark, DE 19711, USA
| |
Collapse
|
10
|
Melvin RL, Godwin RC, Xiao J, Thompson WG, Berenhaut KS, Salsbury FR. Uncovering Large-Scale Conformational Change in Molecular Dynamics without Prior Knowledge. J Chem Theory Comput 2016; 12:6130-6146. [PMID: 27802394 PMCID: PMC5719493 DOI: 10.1021/acs.jctc.6b00757] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
As the length of molecular dynamics (MD) trajectories grows with increasing computational power, so does the importance of clustering methods for partitioning trajectories into conformational bins. Of the methods available, the vast majority require users to either have some a priori knowledge about the system to be clustered or to tune clustering parameters through trial and error. Here we present non-parametric uses of two modern clustering techniques suitable for first-pass investigation of an MD trajectory. Being non-parametric, these methods require neither prior knowledge nor parameter tuning. The first method, HDBSCAN, is fast-relative to other popular clustering methods-and is able to group unstructured or intrinsically disordered systems (such as intrinsically disordered proteins, or IDPs) into bins that represent global conformational shifts. HDBSCAN is also useful for determining the overall stability of a system-as it tends to group stable systems into one or two bins-and identifying transition events between metastable states. The second method, iMWK-Means, with explicit rescaling followed by K-Means, while slower than HDBSCAN, performs well with stable, structured systems such as folded proteins and is able to identify higher resolution details such as changes in relative position of secondary structural elements. Used in conjunction, these clustering methods allow a user to discern quickly and without prior knowledge the stability of a simulated system and identify both local and global conformational changes.
Collapse
Affiliation(s)
- Ryan L. Melvin
- Department of Physics, Wake Forest University, Winston-Salem, North Carolina 27109, United States
| | - Ryan C. Godwin
- Department of Physics, Wake Forest University, Winston-Salem, North Carolina 27109, United States
| | - Jiajie Xiao
- Department of Physics, Wake Forest University, Winston-Salem, North Carolina 27109, United States
| | - William G. Thompson
- Department of Physics, Wake Forest University, Winston-Salem, North Carolina 27109, United States
| | - Kenneth S. Berenhaut
- Department of Mathematics & Statistics, Wake Forest University, Winston-Salem, North Carolina 27109, United States
| | - Freddie R. Salsbury
- Department of Physics, Wake Forest University, Winston-Salem, North Carolina 27109, United States
| |
Collapse
|
11
|
Wu C, Kwon S, Shen X, Pan W. A New Algorithm and Theory for Penalized Regression-based Clustering. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2016; 17:188. [PMID: 31662706 PMCID: PMC6818515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Clustering is unsupervised and exploratory in nature. Yet, it can be performed through penalized regression with grouping pursuit, as demonstrated in Pan et al. (2013). In this paper, we develop a more efficient algorithm for scalable computation and a new theory of clustering consistency for the method. This algorithm, called DC-ADMM, combines difference of convex (DC) programming with the alternating direction method of multipliers (ADMM). This algorithm is shown to be more computationally efficient than the quadratic penalty based algorithm of Pan et al. (2013) because of the former's closed-form updating formulas. Numerically, we compare the DC-ADMM algorithm with the quadratic penalty algorithm to demonstrate its utility and scalability. Theoretically, we establish a finite-sample mis-clustering error bound for penalized regression based clustering with the L 0 constrained regularization in a general setting. On this ground, we provide conditions for clustering consistency of the penalized clustering method. As an end product, we put R package prclust implementing PRclust with various loss and grouping penalty functions available on GitHub and CRAN.
Collapse
Affiliation(s)
| | | | - Xiaotong Shen
- School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Wei Pan
- WP is the corresponding author.
| |
Collapse
|