1
|
Aizenbud Y, Jaffe A, Wang M, Hu A, Amsel N, Nadler B, Chang JT, Kluger Y. Spectral top-down recovery of latent tree models. Inf inference 2023; 12:iaad032. [PMID: 37593361 PMCID: PMC10431953 DOI: 10.1093/imaiai/iaad032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 03/24/2023] [Accepted: 06/24/2023] [Indexed: 08/19/2023]
Abstract
Modeling the distribution of high-dimensional data by a latent tree graphical model is a prevalent approach in multiple scientific domains. A common task is to infer the underlying tree structure, given only observations of its terminal nodes. Many algorithms for tree recovery are computationally intensive, which limits their applicability to trees of moderate size. For large trees, a common approach, termed divide-and-conquer, is to recover the tree structure in two steps. First, separately recover the structure of multiple, possibly random subsets of the terminal nodes. Second, merge the resulting subtrees to form a full tree. Here, we develop spectral top-down recovery (STDR), a deterministic divide-and-conquer approach to infer large latent tree models. Unlike previous methods, STDR partitions the terminal nodes in a non random way, based on the Fiedler vector of a suitable Laplacian matrix related to the observed nodes. We prove that under certain conditions, this partitioning is consistent with the tree structure. This, in turn, leads to a significantly simpler merging procedure of the small subtrees. We prove that STDR is statistically consistent and bound the number of samples required to accurately recover the tree with high probability. Using simulated data from several common tree models in phylogenetics, we demonstrate that STDR has a significant advantage in terms of runtime, with improved or similar accuracy.
Collapse
Affiliation(s)
- Yariv Aizenbud
- Program in Applied Mathematics, Yale University, New Haven, CT 06511, USA
| | - Ariel Jaffe
- Program in Applied Mathematics, Yale University, New Haven, CT 06511, USA
| | - Meng Wang
- Department of Pathology, Yale University, New Haven, CT 06511, USA
| | - Amber Hu
- Program in Applied Mathematics, Yale University, New Haven, CT 06511, USA
| | - Noah Amsel
- Program in Applied Mathematics, Yale University, New Haven, CT 06511, USA
| | - Boaz Nadler
- Department of Computer Science, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Joseph T Chang
- Department of Statistics, Yale University, New Haven, CT 06520, USA
| | - Yuval Kluger
- Program in Applied Mathematics, Yale University, New Haven, CT 06511, USA
- Department of Pathology, Yale University, New Haven, CT 06511, USA
- Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA
| |
Collapse
|
2
|
Luo L, Li L. Online two-way estimation and inference via linear mixed-effects models. Stat Med 2022; 41:5113-5133. [PMID: 35983945 DOI: 10.1002/sim.9557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Revised: 08/01/2022] [Accepted: 08/02/2022] [Indexed: 11/10/2022]
Abstract
In this article, we tackle the estimation and inference problem of analyzing distributed streaming data that is collected continuously over multiple data sites. We propose an online two-way approach via linear mixed-effects models. We explicitly model the site-specific effects as random-effect terms, and tackle both between-site heterogeneity and within-site correlation. We develop an online updating procedure that does not need to re-access the previous data and can efficiently update the parameter estimate, when either new data sites, or new streams of sample observations of the existing data sites, become available. We derive the non-asymptotic error bound for our proposed online estimator, and show that it is asymptotically equivalent to the offline counterpart based on all the raw data. We compare with some key alternative solutions both analytically and numerically, and demonstrate the advantages of our proposal. We further illustrate our method with two data applications.
Collapse
Affiliation(s)
- Lan Luo
- Department of Statistics and Actuarial Science, University of Iowa, Iowa City, Iowa, USA
| | - Lexin Li
- Department of Biostatistics and Epidemiology, University of California, Berkeley, Berkeley, California, USA
| |
Collapse
|
3
|
Wu W, Yang Y, Kang J, He K. Improving large-scale estimation and inference for profiling health care providers. Stat Med 2022; 41:2840-2853. [PMID: 35318706 PMCID: PMC9314652 DOI: 10.1002/sim.9387] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Revised: 02/04/2022] [Accepted: 02/21/2022] [Indexed: 01/25/2023]
Abstract
Provider profiling has been recognized as a useful tool in monitoring health care quality, facilitating inter-provider care coordination, and improving medical cost-effectiveness. Existing methods often use generalized linear models with fixed provider effects, especially when profiling dialysis facilities. As the number of providers under evaluation escalates, the computational burden becomes formidable even for specially designed workstations. To address this challenge, we introduce a serial blockwise inversion Newton algorithm exploiting the block structure of the information matrix. A shared-memory divide-and-conquer algorithm is proposed to further boost computational efficiency. In addition to the computational challenge, the current literature lacks an appropriate inferential approach to detecting providers with outlying performance especially when small providers with extreme outcomes are present. In this context, traditional score and Wald tests relying on large-sample distributions of the test statistics lead to inaccurate approximations of the small-sample properties. In light of the inferential issue, we develop an exact test of provider effects using exact finite-sample distributions, with the Poisson-binomial distribution as a special case when the outcome is binary. Simulation analyses demonstrate improved estimation and inference over existing methods. The proposed methods are applied to profiling dialysis facilities based on emergency department encounters using a dialysis patient database from the Centers for Medicare & Medicaid Services.
Collapse
Affiliation(s)
- Wenbo Wu
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan.,Kidney Epidemiology and Cost Center, University of Michigan, Ann Arbor, Michigan
| | - Yuan Yang
- Parexel International, Newton, Massachusetts
| | - Jian Kang
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan.,Kidney Epidemiology and Cost Center, University of Michigan, Ann Arbor, Michigan
| | - Kevin He
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan.,Kidney Epidemiology and Cost Center, University of Michigan, Ann Arbor, Michigan
| |
Collapse
|
4
|
Mukherjee SS, Sarkar P, Bickel PJ. Two provably consistent divide-and-conquer clustering algorithms for large networks. Proc Natl Acad Sci U S A 2021; 118:e2100482118. [PMID: 34716259 DOI: 10.1073/pnas.2100482118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/30/2021] [Indexed: 11/18/2022] Open
Abstract
In this article, we advance divide-and-conquer strategies for solving the community detection problem in networks. We propose two algorithms that perform clustering on several small subgraphs and finally patch the results into a single clustering. The main advantage of these algorithms is that they significantly bring down the computational cost of traditional algorithms, including spectral clustering, semidefinite programs, modularity-based methods, likelihood-based methods, etc., without losing accuracy, and even improving accuracy at times. These algorithms are also, by nature, parallelizable. Since most traditional algorithms are accurate, and the corresponding optimization problems are much simpler in small problems, our divide-and-conquer methods provide an omnibus recipe for scaling traditional algorithms up to large networks. We prove the consistency of these algorithms under various subgraph selection procedures and perform extensive simulations and real-data analysis to understand the advantages of the divide-and-conquer approach in various settings.
Collapse
|
5
|
Lin Y, Zhang H, Feng J, Shi B, Zhang M, Han Y, Wen W, Zhang T, Qi Y, Wu J. Unclonable Micro-Texture with Clonable Micro-Shape towards Rapid, Convenient, and Low-Cost Fluorescent Anti-Counterfeiting Labels. Small 2021; 17:e2100244. [PMID: 34160145 DOI: 10.1002/smll.202100244] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Revised: 04/02/2021] [Indexed: 05/28/2023]
Abstract
An ideal anti-counterfeiting label not only needs to be unclonable and accurate but also must consider cost and efficiency. But the traditional physical unclonable function (PUF) recognition technology must match all the images in a database one by one. The matching time increases with the number of samples. Here, a new kind of PUF anti-counterfeiting label is introduced with high modifiability, low reagent cost (2.1 × 10-4 USD), simple and fast authentication (overall time 12.17 s), high encoding capacity (2.1 × 10623 ), and its identification software. All inorganic perovskite nanocrystalline films with clonable micro-profile and unclonable micro-texture are prepared by laser engraving for lyophilic patterning, liquid strip sliding for high throughput droplet generation, and evaporative self-assembling for thin film deposition. A variety of crystal film profile shapes can be used as "specificator" for image recognition, and the verification time of recognition technology based on this divide-and-conquer strategy can be decreased by more than 20 times.
Collapse
Affiliation(s)
- Yuhong Lin
- Materials Genome Institute, Shanghai University, Shanghai, 200444, China
| | - Hongkun Zhang
- School of Computer Engineering and Science, Shanghai University, Shanghai, 200444, China
| | - Jingyun Feng
- Materials Genome Institute, Shanghai University, Shanghai, 200444, China
| | - Bori Shi
- Materials Genome Institute, Shanghai University, Shanghai, 200444, China
| | - Mengying Zhang
- Department of Physics, Shanghai University, Shanghai, 200444, China
| | - Yuexing Han
- School of Computer Engineering and Science, Shanghai University, Shanghai, 200444, China
- Shanghai Institute for Advanced Communication and Data Science, Shanghai University, Shanghai, 200444, China
| | - Weijia Wen
- Department of Physics, The Hong Kong University of Science and Technology, Hong Kong, China
| | - Tongyi Zhang
- Materials Genome Institute, Shanghai University, Shanghai, 200444, China
| | - Yabing Qi
- Energy Materials and Surface Sciences Unit (EMSSU), Okinawa Institute of Science and Technology Graduate University (OIST), 1919-1 Tancha, Onna-son, Okinawa, 904-0495, Japan
| | - Jinbo Wu
- Materials Genome Institute, Shanghai University, Shanghai, 200444, China
| |
Collapse
|
6
|
Sakhakarmi S, Park JW. Multi-Level-Phase Deep Learning Using Divide-and-Conquer for Scaffolding Safety. Int J Environ Res Public Health 2020; 17:E2391. [PMID: 32244580 DOI: 10.3390/ijerph17072391] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2020] [Revised: 02/26/2020] [Accepted: 02/27/2020] [Indexed: 11/16/2022]
Abstract
A traditional structural analysis of scaffolding structures requires loading conditions that are only possible during design, but not in operation. Thus, this study proposes a method that can be used during operation to make an automated safety prediction for scaffolds. It implements a divide-and-conquer technique with deep learning. As a test scaffolding, a four-bay, three-story scaffold model was used. Analysis of the model led to 1411 unique safety cases for the model. To apply deep learning, a test simulation generated 1,540,000 datasets for pre-training, and an additional 141,100 datasets for testing purposes. The cases were then sub-divided into 18 categories based on failure modes at both global and local levels, along with a combination of member failures. Accordingly, the divide-and-conquer technique was applied to the 18 categories, each of which were pre-trained by a neural network. For the test datasets, the overall accuracy was 99%. The prediction model showed that 82.78% of the 1411 safety cases showed 100% accuracy for the test datasets, which contributed to the high accuracy. In addition, the higher values of precision, recall, and F1 score for the majority of the safety cases indicate good performance of the model, and a significant improvement compared with past research conducted on simpler cases. Specifically, the method demonstrated improved performance with respect to accuracy and the number of classifications. Thus, the results suggest that the methodology could be reliably applied for the safety assessment of scaffolding systems that are more complex than systems tested in past studies. Furthermore, the implemented methodology can easily be replicated for other classification problems.
Collapse
|
7
|
Abstract
Supertree methods merge a set of overlapping phylogenetic trees into a supertree containing all taxa of the input trees. The challenge in supertree reconstruction is the way of dealing with conflicting information in the input trees. Many different algorithms for different objective functions have been suggested to resolve these conflicts. In particular, there exist methods based on encoding the source trees in a matrix, where the supertree is constructed applying a local search heuristic to optimize the respective objective function. We present a novel heuristic supertree algorithm called Bad Clade Deletion (BCD) supertrees. It uses minimum cuts to delete a locally minimal number of columns from such a matrix representation so that it is compatible. This is the complement problem to Matrix Representation with Compatibility (Maximum Split Fit). Our algorithm has guaranteed polynomial worst-case running time and performs swiftly in practice. Different from local search heuristics, it guarantees to return the directed perfect phylogeny for the input matrix, corresponding to the parent tree of the input trees, if one exists. Comparing supertrees to model trees for simulated data, BCD shows a better accuracy (F1 score) than the state-of-the-art algorithms SuperFine (up to 3%) and Matrix Representation with Parsimony (up to 7%); at the same time, BCD is up to 7 times faster than SuperFine, and up to 600 times faster than Matrix Representation with Parsimony. Finally, using the BCD supertree as a starting tree for a combined Maximum Likelihood analysis using RAxML, we reach significantly improved accuracy (1% higher F1 score) and running time (1.7-fold speedup).
Collapse
Affiliation(s)
- Markus Fleischauer
- Chair for Bioinformatics, Institute for Computer Science, Friedrich-Schiller-University Jena, Jena, Germany
| | - Sebastian Böcker
- Chair for Bioinformatics, Institute for Computer Science, Friedrich-Schiller-University Jena, Jena, Germany
| |
Collapse
|
8
|
Tagore S, Chowdhury N, De RK. Analyzing methods for path mining with applications in metabolomics. Gene 2014; 534:125-38. [PMID: 24230973 DOI: 10.1016/j.gene.2013.10.056] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2013] [Revised: 10/23/2013] [Accepted: 10/25/2013] [Indexed: 11/22/2022]
Abstract
Metabolomics is one of the key approaches of systems biology that consists of studying biochemical networks having a set of metabolites, enzymes, reactions and their interactions. As biological networks are very complex in nature, proper techniques and models need to be chosen for their better understanding and interpretation. One of the useful strategies in this regard is using path mining strategies and graph-theoretical approaches that help in building hypothetical models and perform quantitative analysis. Furthermore, they also contribute to analyzing topological parameters in metabolome networks. Path mining techniques can be based on grammars, keys, patterns and indexing. Moreover, they can also be used for modeling metabolome networks, finding structural similarities between metabolites, in-silico metabolic engineering, shortest path estimation and for various graph-based analysis. In this manuscript, we have highlighted some core and applied areas of path-mining for modeling and analysis of metabolic networks.
Collapse
|