1
|
Palmin V, Mukhin A, Ivanova V, Perepukhov A, Nozik A. Automated component analysis in DOSY NMR using information criteria. J Magn Reson 2023; 355:107541. [PMID: 37688831 DOI: 10.1016/j.jmr.2023.107541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Revised: 08/04/2023] [Accepted: 08/19/2023] [Indexed: 09/11/2023]
Abstract
This study introduces a model selection technique based on Bayesian information criteria for estimating the number of components in a mixture during Diffusion-Ordered Spectroscopy (DOSY) Nuclear Magnetic Resonance (NMR) data analysis. As the accuracy of this technique is dependent on the efficiency of parameter estimators, we further investigate the performance of the Weighted Least Squares (WLS) and Maximum a Posteriori (MAP) estimators. The WLS method, enhanced with meticulously tuned L2-regularization, effectively detects components when the difference in self-diffusion coefficients is more than two-fold, especially when the component with the smaller coefficient has a larger weight ratio. The MAP method, strengthened by a substantial database of prior information, exhibits outstanding precision, decreasing this threshold to 1.5 times. Both estimators provide weight ratio estimates with standard deviations of approximately around 1 percentage point, although the MAP method tends to overestimate the component with a larger self-diffusion coefficient. Deviations from the expected values can exceed 10 percentage points, often due to inaccuracies in component detection. The error estimates are determined using data resampling techniques derived from a large-scale 1000-point experiment and an additional five measurements from a single-component mixture. This approach allowed us to thoroughly examine data distribution characteristics, thereby laying a robust groundwork for future refinement efforts.
Collapse
Affiliation(s)
- Vladimir Palmin
- Moscow Institute of Physics and Technology (National Research University) - MIPT, 1 "A" Kerchenskaya st., Moscow, 117303, Russia.
| | - Andrey Mukhin
- Moscow Institute of Physics and Technology (National Research University) - MIPT, 1 "A" Kerchenskaya st., Moscow, 117303, Russia
| | - Valeriya Ivanova
- Moscow Institute of Physics and Technology (National Research University) - MIPT, 1 "A" Kerchenskaya st., Moscow, 117303, Russia
| | - Alexander Perepukhov
- Moscow Institute of Physics and Technology (National Research University) - MIPT, 1 "A" Kerchenskaya st., Moscow, 117303, Russia
| | - Alexander Nozik
- Moscow Institute of Physics and Technology (National Research University) - MIPT, 1 "A" Kerchenskaya st., Moscow, 117303, Russia
| |
Collapse
|
2
|
Guo X, Chen Y, Tang CY. Information criteria for latent factor models: a study on factor pervasiveness and adaptivity. J Econom 2023; 233:237-250. [PMID: 36938506 PMCID: PMC10022528 DOI: 10.1016/j.jeconom.2022.03.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
We study the information criteria extensively under general conditions for high-dimensional latent factor models. Upon carefully analyzing the estimation errors of the principal component analysis method, we establish theoretical results on the estimation accuracy of the latent factor scores, incorporating the impact from possibly weak factor pervasiveness; our analysis does not require the same factor strength of all the leading factors. To estimate the number of the latent factors, we propose a new penalty specification with a two-fold consideration: i) being adaptive to the strength of the factor pervasiveness, and ii) favoring more parsimonious models. Our theory establishes the validity of the proposed approach under general conditions. Additionally, we construct examples to demonstrate that when the factor strength is too weak, scenarios exist such that no information criterion can consistently identify the latent factors. We illustrate the performance of the proposed adaptive information criteria with extensive numerical examples, including simulations and a real data analysis.
Collapse
Affiliation(s)
- Xiao Guo
- International Institute of Finance, School of Management, University of Science and Technology of China, Hefei, Anhui 230026, People’s Republic of China
| | - Yu Chen
- International Institute of Finance, School of Management, University of Science and Technology of China, Hefei, Anhui 230026, People’s Republic of China
| | - Cheng Yong Tang
- Department of Statistical Science, Temple University, 1810 Liacouras Walk, Philadelphia, Pennsylvania 19122-6083, U.S.A
| |
Collapse
|
3
|
O’Neill M, Burke K. Variable selection using a smooth information criterion for distributional regression models. Stat Comput 2023; 33:71. [PMID: 37155560 PMCID: PMC10121547 DOI: 10.1007/s11222-023-10204-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 01/03/2023] [Indexed: 05/10/2023]
Abstract
Modern variable selection procedures make use of penalization methods to execute simultaneous model selection and estimation. A popular method is the least absolute shrinkage and selection operator, the use of which requires selecting the value of a tuning parameter. This parameter is typically tuned by minimizing the cross-validation error or Bayesian information criterion, but this can be computationally intensive as it involves fitting an array of different models and selecting the best one. In contrast with this standard approach, we have developed a procedure based on the so-called "smooth IC" (SIC) in which the tuning parameter is automatically selected in one step. We also extend this model selection procedure to the distributional regression framework, which is more flexible than classical regression modelling. Distributional regression, also known as multiparameter regression, introduces flexibility by taking account of the effect of covariates through multiple distributional parameters simultaneously, e.g., mean and variance. These models are useful in the context of normal linear regression when the process under study exhibits heteroscedastic behaviour. Reformulating the distributional regression estimation problem in terms of penalized likelihood enables us to take advantage of the close relationship between model selection criteria and penalization. Utilizing the SIC is computationally advantageous, as it obviates the issue of having to choose multiple tuning parameters. Supplementary Information The online version contains supplementary material available at 10.1007/s11222-023-10204-8.
Collapse
Affiliation(s)
- Meadhbh O’Neill
- Department of Mathematics and Statistics, University of Limerick, Limerick, Republic of Ireland
| | - Kevin Burke
- Department of Mathematics and Statistics, University of Limerick, Limerick, Republic of Ireland
| |
Collapse
|
4
|
Staerk C, Mayr A. Randomized boosting with multivariable base-learners for high-dimensional variable selection and prediction. BMC Bioinformatics 2021; 22:441. [PMID: 34530737 PMCID: PMC8447543 DOI: 10.1186/s12859-021-04340-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Accepted: 08/24/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Statistical boosting is a computational approach to select and estimate interpretable prediction models for high-dimensional biomedical data, leading to implicit regularization and variable selection when combined with early stopping. Traditionally, the set of base-learners is fixed for all iterations and consists of simple regression learners including only one predictor variable at a time. Furthermore, the number of iterations is typically tuned by optimizing the predictive performance, leading to models which often include unnecessarily large numbers of noise variables. RESULTS We propose three consecutive extensions of classical component-wise gradient boosting. In the first extension, called Subspace Boosting (SubBoost), base-learners can consist of several variables, allowing for multivariable updates in a single iteration. To compensate for the larger flexibility, the ultimate selection of base-learners is based on information criteria leading to an automatic stopping of the algorithm. As the second extension, Random Subspace Boosting (RSubBoost) additionally includes a random preselection of base-learners in each iteration, enabling the scalability to high-dimensional data. In a third extension, called Adaptive Subspace Boosting (AdaSubBoost), an adaptive random preselection of base-learners is considered, focusing on base-learners which have proven to be predictive in previous iterations. Simulation results show that the multivariable updates in the three subspace algorithms are particularly beneficial in cases of high correlations among signal covariates. In several biomedical applications the proposed algorithms tend to yield sparser models than classical statistical boosting, while showing a very competitive predictive performance also compared to penalized regression approaches like the (relaxed) lasso and the elastic net. CONCLUSIONS The proposed randomized boosting approaches with multivariable base-learners are promising extensions of statistical boosting, particularly suited for highly-correlated and sparse high-dimensional settings. The incorporated selection of base-learners via information criteria induces automatic stopping of the algorithms, promoting sparser and more interpretable prediction models.
Collapse
Affiliation(s)
- Christian Staerk
- Department of Medical Biometry, Informatics and Epidemiology, University Hospital Bonn, Venusberg-Campus 1, 53127, Bonn, Germany.
| | - Andreas Mayr
- Department of Medical Biometry, Informatics and Epidemiology, University Hospital Bonn, Venusberg-Campus 1, 53127, Bonn, Germany
| |
Collapse
|
5
|
Chin V, Ioannidis JPA, Tanner MA, Cripps S. Effect estimates of COVID-19 non-pharmaceutical interventions are non-robust and highly model-dependent. J Clin Epidemiol 2021; 136:96-132. [PMID: 33781862 PMCID: PMC7997643 DOI: 10.1016/j.jclinepi.2021.03.014] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Revised: 03/03/2021] [Accepted: 03/10/2021] [Indexed: 12/21/2022]
Abstract
Objective To compare the inference regarding the effectiveness of the various non-pharmaceutical interventions (NPIs) for COVID-19 obtained from different SIR models. Study design and setting We explored two models developed by Imperial College that considered only NPIs without accounting for mobility (model 1) or only mobility (model 2), and a model accounting for the combination of mobility and NPIs (model 3). Imperial College applied models 1 and 2 to 11 European countries and to the USA, respectively. We applied these models to 14 European countries (original 11 plus another 3), over two different time horizons. Results While model 1 found that lockdown was the most effective measure in the original 11 countries, model 2 showed that lockdown had little or no benefit as it was typically introduced at a point when the time-varying reproduction number was already very low. Model 3 found that the simple banning of public events was beneficial, while lockdown had no consistent impact. Based on Bayesian metrics, model 2 was better supported by the data than either model 1 or model 3 for both time horizons. Conclusion Inferences on effects of NPIs are non-robust and highly sensitive to model specification. In the SIR modeling framework, the impacts of lockdown are uncertain and highly model-dependent.
Collapse
Affiliation(s)
- Vincent Chin
- Australian Research Council Training Centre in Data Analytics for Resources and Environments, Sydney, New South Wales, Australia; School of Mathematics and Statistics, The University of Sydney, Sydney, New South Wales, Australia
| | - John P A Ioannidis
- Stanford Prevention Research Center, Department of Medicine, Stanford University, Stanford, CA, USA; Department of Epidemiology and Population Health, Stanford University, Stanford, CA, USA; Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA; Department of Statistics, Stanford University, Stanford, CA, USA; Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, CA, USA.
| | - Martin A Tanner
- Department of Statistics, Northwestern University, Evanston, IL, USA
| | - Sally Cripps
- Australian Research Council Training Centre in Data Analytics for Resources and Environments, Sydney, New South Wales, Australia; School of Mathematics and Statistics, The University of Sydney, Sydney, New South Wales, Australia
| |
Collapse
|
6
|
Takahashi K, Shimadzu H. Detecting multiple spatial disease clusters: information criterion and scan statistic approach. Int J Health Geogr 2020; 19:33. [PMID: 32878638 PMCID: PMC7469351 DOI: 10.1186/s12942-020-00228-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Accepted: 08/25/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Detecting the geographical tendency for the presence of a disease or incident is, particularly at an early stage, a key challenge for preventing severe consequences. Given recent rapid advancements in information technologies, it is required a comprehensive framework that enables simultaneous detection of multiple spatial clusters, whether disease cases are randomly scattered or clustered around specific epicenters on a larger scale. We develop a new methodology that detects multiple spatial disease clusters and evaluates its performance compared to existing other methods. METHODS A novel framework for spatial multiple-cluster detection is developed. The framework directly stands on the integrated bases of scan statistics and generalized linear models, adopting a new information criterion that selects the appropriate number of disease clusters. We evaluated the proposed approach using a real dataset, the hospital admission for chronic obstructive pulmonary disease (COPD) in England, and simulated data, whether the approach tends to select the correct number of clusters. RESULTS A case study and simulation studies conducted both confirmed that the proposed method performed better compared to conventional cluster detection procedures, in terms of higher sensitivity. CONCLUSIONS We proposed a new statistical framework that simultaneously detects and evaluates multiple disease clusters in a large study space, with high detection power compared to conventional approaches.
Collapse
Affiliation(s)
- Kunihiko Takahashi
- Department of Biostatistics, M&D Data Science Center, Tokyo Medical and Dental University, 1-5-45, Yushima, Bunkyo-ku, Tokyo, 113-8510, Japan.
| | - Hideyasu Shimadzu
- Department of Mathematical Sciences, Loughborough University, Loughborough, Leicestershire, UK.,Teikyo University Graduate School of Public Health, Tokyo, Japan
| |
Collapse
|
7
|
Carvajal-Rodríguez A. Multi-model inference of non-random mating from an information theoretic approach. Theor Popul Biol 2019; 131:38-53. [PMID: 31756362 DOI: 10.1016/j.tpb.2019.11.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Revised: 10/17/2019] [Accepted: 11/05/2019] [Indexed: 10/25/2022]
Abstract
Non-random mating has a significant impact on the evolution of organisms. Here, I developed a modelling framework for discrete traits (with any number of phenotypes) to explore different models connecting the non-random mating causes (mate competition and/or mate choice) and their consequences (sexual selection and/or assortative mating). I derived the formulaefor the maximum likelihood estimates of each model and used information criteria to perform multi-model inference. Simulation results showed a good performance of both model selection and parameter estimation. The methodology was applied to ecotypes data of the marine gastropod Littorina saxatilis from Galicia (Spain), to show that the mating pattern is better described by models with two parameters that involve both mate choice and competition, generating positive assortative mating plus female sexual selection. As far as I know, this is the first standardized methodology for model selection and multi-model inference of mating parameters for discrete traits. The advantages of this framework include the ability of setting up models from which the parameters connect causes, as mate competition and mate choice, with their outcome in the form of data patterns of sexual selection and assortative mating. For some models, the parameters may have a double effect i.e. they produce sexual selection and assortative mating, while for others there are separated parameters for one kind of pattern or another. From an empirical point of view, it is much easier to study patterns than processes and, for this reason, the causal mechanisms of sexual selection are not so well known as the patterns they produce. The goal of the present work is to propose a new tool that helps to distinguish among different alternative processes behind the observed mating pattern. The full methodology was implemented in a software called InfoMating (available at http://acraaj.webs6.uvigo.es/InfoMating/Infomating.htm).
Collapse
Affiliation(s)
- A Carvajal-Rodríguez
- Departamento de Bioquímica, Genética e Inmunología. Universidad de Vigo, 36310 Vigo, Spain.
| |
Collapse
|
8
|
Shi C, Lu W, Song R. Determining the Number of Latent Factors in Statistical Multi-Relational Learning. J Mach Learn Res 2019; 20:23. [PMID: 31983896 PMCID: PMC6980192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Statistical relational learning is primarily concerned with learning and inferring relationships between entities in large-scale knowledge graphs. Nickel et al. (2011) proposed a RESCAL tensor factorization model for statistical relational learning, which achieves better or at least comparable results on common benchmark data sets when compared to other state-of-the-art methods. Given a positive integer s, RESCAL computes an s-dimensional latent vector for each entity. The latent factors can be further used for solving relational learning tasks, such as collective classification, collective entity resolution and link-based clustering. The focus of this paper is to determine the number of latent factors in the RESCAL model. Due to the structure of the RESCAL model, its log-likelihood function is not concave. As a result, the corresponding maximum likelihood estimators (MLEs) may not be consistent. Nonetheless, we design a specific pseudometric, prove the consistency of the MLEs under this pseudometric and establish its rate of convergence. Based on these results, we propose a general class of information criteria and prove their model selection consistencies when the number of relations is either bounded or diverges at a proper rate of the number of entities. Simulations and real data examples show that our proposed information criteria have good finite sample properties.
Collapse
Affiliation(s)
- Chengchun Shi
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| | - Rui Song
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| |
Collapse
|
9
|
De Beuckeleer LI, Herrebout WA. Exploring the limits of cryospectroscopy: Least-squares based approaches for analyzing the self-association of HCl. Spectrochim Acta A Mol Biomol Spectrosc 2016; 154:89-97. [PMID: 26519915 DOI: 10.1016/j.saa.2015.10.012] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2015] [Revised: 10/07/2015] [Accepted: 10/19/2015] [Indexed: 06/05/2023]
Abstract
To rationalize the concentration dependent behavior observed for a large spectral data set of HCl recorded in liquid argon, least-squares based numerical methods are developed and validated. In these methods, for each wavenumber a polynomial is used to mimic the relation between monomer concentrations and measured absorbances. Least-squares fitting of higher degree polynomials tends to overfit and thus leads to compensation effects where a contribution due to one species is compensated for by a negative contribution of another. The compensation effects are corrected for by carefully analyzing, using AIC and BIC information criteria, the differences observed between consecutive fittings when the degree of the polynomial model is systematically increased, and by introducing constraints prohibiting negative absorbances to occur for the monomer or for one of the oligomers. The method developed should allow other, more complicated self-associating systems to be analyzed with a much higher accuracy than before.
Collapse
Affiliation(s)
- Liene I De Beuckeleer
- Department of Chemistry, University of Antwerp, Groenenborgerlaan 171, 2020 Antwerp, Belgium
| | - Wouter A Herrebout
- Department of Chemistry, University of Antwerp, Groenenborgerlaan 171, 2020 Antwerp, Belgium.
| |
Collapse
|