1
|
Zhu C, Wang JL. Testing homogeneity: the trouble with sparse functional data. J R Stat Soc Series B Stat Methodol 2023; 85:705-731. [PMID: 37521166 PMCID: PMC10376451 DOI: 10.1093/jrsssb/qkad021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2022] [Revised: 12/06/2022] [Accepted: 02/25/2023] [Indexed: 08/01/2023]
Abstract
Testing the homogeneity between two samples of functional data is an important task. While this is feasible for intensely measured functional data, we explain why it is challenging for sparsely measured functional data and show what can be done for such data. In particular, we show that testing the marginal homogeneity based on point-wise distributions is feasible under some mild constraints and propose a new two-sample statistic that works well with both intensively and sparsely measured functional data. The proposed test statistic is formulated upon energy distance, and the convergence rate of the test statistic to its population version is derived along with the consistency of the associated permutation test. The aptness of our method is demonstrated on both synthetic and real data sets.
Collapse
Affiliation(s)
- Changbo Zhu
- Address for correspondence: Changbo Zhu, Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN 46556, USA.
| | - Jane-Ling Wang
- Department of Statistics, University of California, Davis, Davis, United States
| |
Collapse
|
2
|
James-Stein for the leading eigenvector. Proc Natl Acad Sci U S A 2023; 120:e2207046120. [PMID: 36603029 PMCID: PMC9926287 DOI: 10.1073/pnas.2207046120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Recent research identifies and corrects bias, such as excess dispersion, in the leading sample eigenvector of a factor-based covariance matrix estimated from a high-dimension low sample size (HL) data set. We show that eigenvector bias can have a substantial impact on variance-minimizing optimization in the HL regime, while bias in estimated eigenvalues may have little effect. We describe a data-driven eigenvector shrinkage estimator in the HL regime called "James-Stein for eigenvectors" (JSE) and its close relationship with the James-Stein (JS) estimator for a collection of averages. We show, both theoretically and with numerical experiments, that, for certain variance-minimizing problems of practical importance, efforts to correct eigenvalues have little value in comparison to the JSE correction of the leading eigenvector. When certain extra information is present, JSE is a consistent estimator of the leading eigenvector.
Collapse
|
4
|
Gillard J, O’Riordan E, Zhigljavsky A. Polynomial whitening for high-dimensional data. Comput Stat 2022. [DOI: 10.1007/s00180-022-01277-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
AbstractThe inverse square root of a covariance matrix is often desirable for performing data whitening in the process of applying many common multivariate data analysis methods. Direct calculation of the inverse square root is not available when the covariance matrix is either singular or nearly singular, as often occurs in high dimensions. We develop new methods, which we broadly call polynomial whitening, to construct a low-degree polynomial in the empirical covariance matrix which has similar properties to the true inverse square root of the covariance matrix (should it exist). Our method does not suffer in singular or near-singular settings, and is computationally tractable in high dimensions. We demonstrate that our construction of low-degree polynomials provides a good substitute for high-dimensional inverse square root covariance matrices, in both $$d < N$$
d
<
N
and $$d \ge N$$
d
≥
N
cases. We offer examples on data whitening, outlier detection and principal component analysis to demonstrate the performance of the proposed method.
Collapse
|
5
|
Imoto Y, Nakamura T, Escolar EG, Yoshiwaki M, Kojima Y, Yabuta Y, Katou Y, Yamamoto T, Hiraoka Y, Saitou M. Resolution of the curse of dimensionality in single-cell RNA sequencing data analysis. Life Sci Alliance 2022; 5:e202201591. [PMID: 35944930 PMCID: PMC9363502 DOI: 10.26508/lsa.202201591] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 07/15/2022] [Accepted: 07/18/2022] [Indexed: 11/24/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) can determine gene expression in numerous individual cells simultaneously, promoting progress in the biomedical sciences. However, scRNA-seq data are high-dimensional with substantial technical noise, including dropouts. During analysis of scRNA-seq data, such noise engenders a statistical problem known as the curse of dimensionality (COD). Based on high-dimensional statistics, we herein formulate a noise reduction method, RECODE (resolution of the curse of dimensionality), for high-dimensional data with random sampling noise. We show that RECODE consistently resolves COD in relevant scRNA-seq data with unique molecular identifiers. RECODE does not involve dimension reduction and recovers expression values for all genes, including lowly expressed genes, realizing precise delineation of cell fate transitions and identification of rare cells with all gene information. Compared with representative imputation methods, RECODE employs different principles and exhibits superior overall performance in cell-clustering, expression value recovery, and single-cell-level analysis. The RECODE algorithm is parameter-free, data-driven, deterministic, and high-speed, and its applicability can be predicted based on the variance normalization performance. We propose RECODE as a powerful strategy for preprocessing noisy high-dimensional data.
Collapse
Affiliation(s)
- Yusuke Imoto
- Institute for the Advanced Study of Human Biology, Kyoto University Institute for Advanced Study, Kyoto University, Kyoto, Japan
| | - Tomonori Nakamura
- Institute for the Advanced Study of Human Biology, Kyoto University Institute for Advanced Study, Kyoto University, Kyoto, Japan
- Department of Anatomy and Cell Biology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
- The Hakubi Center for Advanced Research, Kyoto University, Kyoto, Japan
| | - Emerson G Escolar
- Graduate School of Human Development and Environment, Kobe University, Kobe, Japan
- Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan
| | | | - Yoji Kojima
- Institute for the Advanced Study of Human Biology, Kyoto University Institute for Advanced Study, Kyoto University, Kyoto, Japan
- Department of Anatomy and Cell Biology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
- Center for iPS Cell Research and Application, Kyoto University, Kyoto, Japan
| | - Yukihiro Yabuta
- Institute for the Advanced Study of Human Biology, Kyoto University Institute for Advanced Study, Kyoto University, Kyoto, Japan
- Department of Anatomy and Cell Biology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Yoshitaka Katou
- Department of Anatomy and Cell Biology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Takuya Yamamoto
- Institute for the Advanced Study of Human Biology, Kyoto University Institute for Advanced Study, Kyoto University, Kyoto, Japan
- Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan
- Center for iPS Cell Research and Application, Kyoto University, Kyoto, Japan
| | - Yasuaki Hiraoka
- Institute for the Advanced Study of Human Biology, Kyoto University Institute for Advanced Study, Kyoto University, Kyoto, Japan
- Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan
- Center for Advanced Study, Kyoto University Institute for Advanced Study, Kyoto University, Kyoto, Japan
| | - Mitinori Saitou
- Institute for the Advanced Study of Human Biology, Kyoto University Institute for Advanced Study, Kyoto University, Kyoto, Japan
- Department of Anatomy and Cell Biology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
- Center for iPS Cell Research and Application, Kyoto University, Kyoto, Japan
| |
Collapse
|
7
|
Allmon AG, Marron JS, Hudgens MG. diproperm: An R Package for the DiProPerm Test. THE R JOURNAL 2021; 13:266-272. [PMID: 35721233 PMCID: PMC9202909 DOI: 10.32614/rj-2021-072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
High-dimensional low sample size (HDLSS) data sets frequently emerge in many biomedical applications. The direction-projection-permutation (DiProPerm) test is a two-sample hypothesis test for comparing two high-dimensional distributions. The DiProPerm test is exact, i.e., the type I error is guaranteed to be controlled at the nominal level for any sample size, and thus is applicable in the HDLSS setting. This paper discusses the key components of the DiProPerm test, introduces the diproperm R package, and demonstrates the package on a real-world data set.
Collapse
Affiliation(s)
- Andrew G Allmon
- University of North Carolina at Chapel Hill, Department of Biostatistics
| | - J S Marron
- University of North Carolina at Chapel Hill, Department of Biostatistics
| | - Michael G Hudgens
- University of North Carolina at Chapel Hill, Department of Biostatistics
| |
Collapse
|
10
|
Affiliation(s)
- Changbo Zhu
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, 61820, USA
| | - Xiaofeng Shao
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, 61820, USA
| |
Collapse
|
12
|
Chang W, Ahn J, Jung S. Double data piling leads to perfect classification. Electron J Stat 2021. [DOI: 10.1214/21-ejs1945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Woonyoung Chang
- Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - Jeongyoun Ahn
- Department of Industrial and Systems Engineering, KAIST, Daejeon 34141, South Korea
| | - Sungkyu Jung
- Department of Statistics, Seoul National University, Seoul 08826, South Korea
| |
Collapse
|