1
|
A redescending M-estimator approach for outlier-resilient modeling. Sci Rep 2024; 14:7131. [PMID: 38532107 DOI: 10.1038/s41598-024-57906-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 03/22/2024] [Indexed: 03/28/2024] Open
Abstract
The OLS model is built on the assumption of normality in the distribution of error terms. However, this assumption can be easily violated, especially when there are outliers in the data. A single outlier can disrupt the normality assumption of error terms, making the OLS model less effective. In such situations, M-estimators (MEs) come into play to obtain reliable estimates. We introduce a redescending M-estimators (RME) for robust regression to handle datasets with outliers. The proposed RME produces more robust estimates by effectively managing the influence of outliers, even at lower values of the tuning constant. We compared the performance of this estimator with existing RMEs using real-life data examples and an extensive simulation study. The results show that our suggested RME is more efficient than the compared ME in various situations.
Collapse
|
2
|
Quantized minimum error entropy with fiducial points for robust regression. Neural Netw 2023; 168:405-418. [PMID: 37804744 DOI: 10.1016/j.neunet.2023.09.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 08/28/2023] [Accepted: 09/19/2023] [Indexed: 10/09/2023]
Abstract
Minimum error entropy with fiducial points (MEEF) has received a lot of attention, due to its outstanding performance to curb the negative influence caused by non-Gaussian noises in the fields of machine learning and signal processing. However, the estimate of the information potential of MEEF involves a double summation operator based on all available error samples, which can result in large computational burden in many practical scenarios. In this paper, an efficient quantization method is therefore adopted to represent the primary set of error samples with a smaller subset, generating a quantized MEEF (QMEEF). Some basic properties of QMEEF are presented and proved from theoretical perspectives. In addition, we have applied this new criterion to train a class of linear-in-parameters models, including the commonly used linear regression model, random vector functional link network, and broad learning system as special cases. Experimental results on various datasets are reported to demonstrate the desirable performance of the proposed methods to perform regression tasks with contaminated data.
Collapse
|
3
|
Robust statistical methods for high-dimensional data, with applications in tribology. Anal Chim Acta 2023; 1279:341762. [PMID: 37827663 DOI: 10.1016/j.aca.2023.341762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 08/08/2023] [Accepted: 08/28/2023] [Indexed: 10/14/2023]
Abstract
Data sets derived from practical experiments often pose challenges for (robust) statistical methods. In high-dimensional data sets, more variables than observations are recorded and often, there are also data present that do not follow the structure of the data majority. In order to handle such data with outlying observations, a variety of robust regression and classification methods have been developed for low-dimensional data. The high-dimensional case, however, is more challenging, and the variety of robust methods is much more limited. The choice of the method depends on the specific data structure, and numerical problems are more likely to occur. We give an overview of selected robust methods as well as implementations and demonstrate the application with two high-dimensional data sets from tribology. We show that robust statistical methods combined with appropriate pre-processing and sampling strategies yield increased prediction performance and insight into data differing from the majority.
Collapse
|
4
|
RegCloser: a robust regression approach to closing genome gaps. BMC Bioinformatics 2023; 24:249. [PMID: 37312038 DOI: 10.1186/s12859-023-05367-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 05/27/2023] [Indexed: 06/15/2023] Open
Abstract
BACKGROUND Closing gaps in draft genomes leads to more complete and continuous genome assemblies. The ubiquitous genomic repeats are challenges to the existing gap-closing methods, based on either the k-mer representation by the de Bruijn graph or the overlap-layout-consensus paradigm. Besides, chimeric reads will cause erroneous k-mers in the former and false overlaps of reads in the latter. RESULTS We propose a novel local assembly approach to gap closing, called RegCloser. It represents read coordinates and their overlaps respectively by parameters and observations in a linear regression model. The optimal overlap is searched only in the restricted range consistent with insert sizes. Under this linear regression framework, the local DNA assembly becomes a robust parameter estimation problem. We solved the problem by a customized robust regression procedure that resists the influence of false overlaps by optimizing a convex global Huber loss function. The global optimum is obtained by iteratively solving the sparse system of linear equations. On both simulated and real datasets, RegCloser outperformed other popular methods in accurately resolving the copy number of tandem repeats, and achieved superior completeness and contiguity. Applying RegCloser to a plateau zokor draft genome that had been improved by long reads further increased contig N50 to 3-fold long. We also tested the robust regression approach on layout generation of long reads. CONCLUSIONS RegCloser is a competitive gap-closing tool. The software is available at https://github.com/csh3/RegCloser . The robust regression approach has a prospect to be incorporated into the layout module of long read assemblers.
Collapse
|
5
|
Mitigating the impact of outliers in traffic crash analysis: A robust Bayesian regression approach with application to tunnel crash data. ACCIDENT; ANALYSIS AND PREVENTION 2023; 185:107019. [PMID: 36907031 DOI: 10.1016/j.aap.2023.107019] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 02/05/2023] [Accepted: 03/02/2023] [Indexed: 06/18/2023]
Abstract
Traffic crash datasets are often marred by the presence of anomalous data points, commonly referred to as outliers. These outliers can have a profound impact on the results obtained through the application of traditional methods such as logit and probit models, commonly used in the domain of traffic safety analysis, resulting in biased and unreliable estimates. To mitigate this issue, this study introduces a robust Bayesian regression approach, the robit model, which utilizes a heavy-tailed Student's t distribution to replace the link function of these thin-tailed distributions, effectively reducing the influence of outliers on the analysis. Furthermore, a sandwich algorithm based on data augmentation is proposed to enhance the estimation efficiency of posteriors. The proposed model is rigorously tested using a dataset of tunnel crashes, and the results demonstrate its efficiency, robustness, and superior performance compared to traditional methods. The study also reveals that several factors such as night and speeding have a significant impact on the injury severity of tunnel crashes. This research provides a comprehensive understanding of the outliers treatment methods in traffic safety studies and offers valuable recommendations for the development of appropriate countermeasures to effectively prevent severe injuries in tunnel crashes.
Collapse
|
6
|
Robust M-estimation-based maximum correntropy Kalman filter. ISA TRANSACTIONS 2023; 136:198-209. [PMID: 36372604 DOI: 10.1016/j.isatra.2022.10.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Revised: 09/30/2022] [Accepted: 10/22/2022] [Indexed: 05/16/2023]
Abstract
In this paper, a framework that combines an M-estimation and information-theoretic-learning (ITL)-based Kalman filter under impulsive noises is presented. The ITL-based methods make the most of the features of the data itself and can improve robustness by choosing an appropriate kernel bandwidth. However, small kernel bandwidths may lead to divergence. Nonetheless, robust-regression methods can improve the robustness from the statistical perspective and are independent of kernel bandwidth. This motivates us to fuse M-estimation-based weighting methods and the ITL-based Kalman filter. The proposed framework inhibits the divergence trend of ITL-based Kalman filters at low kernel bandwidth and improves the performance at large kernel bandwidth. Additionally, we use the unscented Kalman filtering method to extend the proposed algorithm to the nonlinear case. Monte Carlo simulations demonstrate the robustness and effectiveness of the proposed algorithm.
Collapse
|
7
|
Robust regression based genome-wide multi-trait QTL analysis. Mol Genet Genomics 2021; 296:1103-1119. [PMID: 34170407 DOI: 10.1007/s00438-021-01801-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Accepted: 06/01/2021] [Indexed: 10/21/2022]
Abstract
In genome-wide quantitative trait locus (QTL) mapping studies, multiple quantitative traits are often measured along with the marker genotypes. Multi-trait QTL (MtQTL) analysis, which includes multiple quantitative traits together in a single model, is an efficient technique to increase the power of QTL identification. The two most widely used classical approaches for MtQTL mapping are Gaussian Mixture Model-based MtQTL (GMM-MtQTL) and Linear Regression Model-based MtQTL (LRM-MtQTL) analyses. There are two types of LRM-MtQTL approach known as least squares-based LRM-MtQTL (LS-LRM-MtQTL) and maximum likelihood-based LRM-MtQTL (ML-LRM-MtQTL). These three classical approaches are equivalent alternatives for QTL detection, but ML-LRM-MtQTL is computationally faster than GMM-MtQTL and LS-LRM-MtQTL. However, one major limitation common to all the above classical approaches is that they are very sensitive to outliers, which leads to misleading results. Therefore, in this study, we developed an LRM-based robust MtQTL approach, called LRM-RobMtQTL, for the backcross population based on the robust estimation of regression parameters by maximizing the β-likelihood function induced from the β-divergence with multivariate normal distribution. When β = 0, the proposed LRM-RobMtQTL method reduces to the classical ML-LRM-MtQTL approach. Simulation studies showed that both ML-LRM-MtQTL and LRM-RobMtQTL methods identified the same QTL positions in the absence of outliers. However, in the presence of outliers, only the proposed method was able to identify all the true QTL positions. Real data analysis results revealed that in the presence of outliers only our LRM-RobMtQTL approach can identify all the QTL positions as those identified in the absence of outliers by both methods. We conclude that our proposed LRM-RobMtQTL analysis approach outperforms the classical MtQTL analysis methods.
Collapse
|
8
|
Individual differences in voice adaptability are specifically linked to voice perception skill. Cognition 2021; 210:104582. [PMID: 33450447 DOI: 10.1016/j.cognition.2021.104582] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 01/01/2021] [Accepted: 01/03/2021] [Indexed: 10/22/2022]
Abstract
There are remarkable individual differences in the ability to recognise individuals by the sound of their voice. Theoretically, this ability is thought to depend on the coding accuracy of voices in a low-dimensional "voice-space". Here we were interested in how adaptive coding of voice identity relates to this variability in skill. In two adaptation experiments we explored first whether the aftereffect size to two familiar vocal identities can predict voice perception ability and second, whether this effect stems from general auditory skill (e.g. discrimination ability for tuning and tempo). Experiment 1 demonstrated that contrastive aftereffect sizes for voice identity predicted voice perception ability. In Experiment 2, we replicated this finding and further established that this effect is unrelated to general auditory abilities or general adaptability of listeners. Our results highlight the important functional role of adaptive coding in voice expertise and suggest that human voice perception is a highly specialised and distinct auditory ability.
Collapse
|
9
|
Preprocessing alternatives for compositional data related to water, sanitation and hygiene. THE SCIENCE OF THE TOTAL ENVIRONMENT 2020; 743:140519. [PMID: 32663686 PMCID: PMC7316445 DOI: 10.1016/j.scitotenv.2020.140519] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 06/22/2020] [Accepted: 06/24/2020] [Indexed: 06/11/2023]
Abstract
The Sustainable Development Goals (SDGs) 6.1 and 6.2 measure the progress of urban and rural populations in their access to different levels of water, sanitation and hygiene (WASH) services, based on multiple sources of information. Service levels add up to 100%; therefore, they are compositional data (CoDa). Despite evidence of zero value, missing data and outliers in the sources of information, the treatment of these irregularities with different statistical techniques has not yet been analyzed for CoDa in the WASH sector. Thus, the results may present biased estimates, and the decisions based on these results will not necessarily be appropriate. In this article, we therefore: i) evaluate methodological imputation alternatives that address the problem of having either zero values or missing values, or both simultaneously; and ii) propose the need to complement the point-to-point identification of the WHO/UNICEF Joint Monitoring Program (JMP) with other robust alternatives, to deal with outliers depending on the number of data points. These suggestions have been considered here using statistics for CoDa with isometric log-ratio (ilr) transformation. A selection of illustrative cases is presented to compare performance of different alternatives.
Collapse
|
10
|
Sustainability efficiency and carbon inequality of the Chinese transportation system: A Robust Bayesian Stochastic Frontier Analysis. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2020; 260:110163. [PMID: 32090849 DOI: 10.1016/j.jenvman.2020.110163] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Revised: 01/17/2020] [Accepted: 01/18/2020] [Indexed: 06/10/2023]
Abstract
This study focuses on the sustainability efficiency of the Chinese transportation system by investigating the relationship between CO2 emission levels and the respective freight and passenger turnovers for each transportation mode from January 1999 to December 2017. A novel Robust Bayesian Stochastic Frontier Analysis (RBSFA) is developed by taking carbon inequality into account. In this model, the aggregated variance/covariance matrix for the three classical distributional assumptions of the inefficiency term-Gamma, Exponential, and Half-Normal-is minimized, yielding lower Deviance Information Criteria when compared to each classical assumption separately. Results are controlled for the impact of major macro-economic variables related to fiscal policy, monetary policy, inflationary pressure, and economic activity. Results indicate that the Chinese transportation system shows high sustainability efficiency with relatively small random fluctuations explained by macro-economic policies. Waterway, railway, and roadway transportation modes improved sustainability efficiency of freight traffic while only the railway transportation mode improved sustainability efficiency of passenger traffic. However, the air transportation mode decreased sustainability efficiency of both freight and passenger traffic. The present research helps in reaching governmental policies based not only on the internal dynamics of carbon inequality among different transportation modes, but also in terms of macro-economic impacts on the Chinese transportation sector.
Collapse
|
11
|
Incorporating sampling weights into robust estimation of Cox proportional hazards regression model, with illustration in the Multi-Ethnic Study of Atherosclerosis. BMC Med Res Methodol 2020; 20:62. [PMID: 32169052 PMCID: PMC7071747 DOI: 10.1186/s12874-020-00945-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Accepted: 03/02/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cox proportional hazards regression models are used to evaluate associations between exposures of interest and time-to-event outcomes in observational data. When exposures are measured on only a sample of participants, as they are in a case-cohort design, the sampling weights must be incorporated into the regression model to obtain unbiased estimating equations. METHODS Robust Cox methods have been developed to better estimate associations when there are influential outliers in the exposure of interest, but these robust methods do not incorporate sampling weights. In this paper, we extend these robust methods, which already incorporate influence weights, so that they also accommodate sampling weights. RESULTS Simulations illustrate that in the presence of influential outliers, the association estimate from the weighted robust method is closer to the true value than the estimate from traditional weighted Cox regression. As expected, in the absence of outliers, the use of robust methods yields a small loss of efficiency. Using data from a case-cohort study that is nested within the Multi-Ethnic Study of Atherosclerosis (MESA) longitudinal cohort study, we illustrate differences between traditional and robust weighted Cox association estimates for the relationships between immune cell traits and risk of stroke. CONCLUSIONS Robust weighted Cox regression methods are a new tool to analyze time-to-event data with sampling, e.g. case-cohort data, when exposures of interest contain outliers.
Collapse
|
12
|
The influence of trend estimation method on forecasting curriculum-based measurement of reading performance. J Sch Psychol 2019; 74:44-57. [PMID: 31213231 DOI: 10.1016/j.jsp.2019.04.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2018] [Revised: 01/24/2019] [Accepted: 04/17/2019] [Indexed: 10/26/2022]
Abstract
Estimating a trend line through words read correct per minute scores collected across successive weeks is a preferred method to evaluate student response to instruction with curriculum-based measurement of reading (CBM-R). This is due in part, because the slope of that line of best fit is used to predict the trajectory of student performance if the current intervention is maintained. In turn, trend lines should predict future scores with a high degree of accuracy when an intervention is maintained. We evaluated the forecasting accuracy of a trend estimation method currently used in practice (i.e., ordinary least squares), and five alternate methods recently evaluated in CBM-R simulation studies, using actual student data. Results suggest that alternate trend estimation methods predicted future performance with a similar level of accuracy as ordinary least squares trend lines across most conditions, with the exception of slopes estimated via Bayesian analysis. Bayesian trend lines estimated using informed prior distributions yielded noticeably less biased and more precise predictions when applied to short data series relative to all other estimation methods across most conditions. Outcomes from the current study highlight the need to further explore the viability of Bayesian analysis to evaluate individual time series data.
Collapse
|
13
|
Abstract
A robust regression methodology is proposed via M-estimation. The approach adapts to the tail behavior and skewness of the distribution of the random error terms, providing for a reliable analysis under a broad class of distributions. This is accomplished by allowing the objective function, used to determine the regression parameter estimates, to be selected in a data driven manner. The asymptotic properties of the proposed estimator are established and a numerical algorithm is provided to implement the methodology. The finite sample performance of the proposed approach is exhibited through simulation and the approach was used to analyze two motivating datasets.
Collapse
|
14
|
Quantifying mitochondrial DNA copy number using robust regression to interpret real time PCR results. BMC Res Notes 2017; 10:593. [PMID: 29132417 PMCID: PMC5683470 DOI: 10.1186/s13104-017-2913-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2016] [Accepted: 11/02/2017] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Real time PCR (rtPCR) is a quantitative assay to determine the relative DNA copy number in a sample versus a reference. The [Formula: see text] method is the standard for the analysis of the output data generated by an rtPCR experiment. We developed an alternative based on fitting a robust regression to the rtPCR signal. This new data analysis tool reduces potential biases and does not require all of the compared DNA fragments to have the same PCR efficiency. RESULTS Comparing the two methods when analysing 96 identical PCR preparations showed similar distributions of the estimated copy numbers. Estimating the efficiency with the [Formula: see text] method, however, required a dilution series, which is not necessary for the robust regression method. We used rtPCR to quantify mitochondrial DNA (mtDNA) copy numbers in three different tissues types: breast, colon and prostate. For each type, normal tissue and a tumor from the same three patients were analysed. This gives a total of six samples. The mitochondrial copy number is estimated to lie between 200 and 300 copies per cell. Similar results are obtained when using the robust regression or the [Formula: see text] method. Confidence ratios were slightly narrower for the robust regression. The new data analysis method has been implemented as an R package.
Collapse
|
15
|
Phenological patterns of flowering across biogeographical regions of Europe. INTERNATIONAL JOURNAL OF BIOMETEOROLOGY 2017; 61:1347-1358. [PMID: 28220255 DOI: 10.1007/s00484-017-1312-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/28/2016] [Revised: 01/18/2017] [Accepted: 01/18/2017] [Indexed: 05/21/2023]
Abstract
Long-term changes of plant phenological phases determined by complex interactions of environmental factors are in the focus of recent climate impact research. There is a lack of studies on the comparison of biogeographical regions in Europe in terms of plant responses to climate. We examined the flowering phenology of plant species to identify the spatio-temporal patterns in their responses to environmental variables over the period 1970-2010. Data were collected from 12 countries along a 3000-km-long, North-South transect from northern to eastern Central Europe.Biogeographical regions of Europe were covered from Finland to Macedonia. Robust statistical methods were used to determine the most influential factors driving the changes of the beginning of flowering dates. Significant species-specific advancements in plant flowering onsets within the Continental (3 to 8.3 days), Alpine (2 to 3.8 days) and by highest magnitude in the Boreal biogeographical regions (2.2 to 9.6 days per decades) were found, while less pronounced responses were detected in the Pannonian and Mediterranean regions. While most of the other studies only use mean temperature in the models, we show that also the distribution of minimum and maximum temperatures are reasonable to consider as explanatory variable. Not just local (e.g. temperature) but large scale (e.g. North Atlantic Oscillation) climate factors, as well as altitude and latitude play significant role in the timing of flowering across biogeographical regions of Europe. Our analysis gave evidences that species show a delay in the timing of flowering with an increase in latitude (between the geographical coordinates of 40.9 and 67.9), and an advance with changing climate. The woody species (black locust and small-leaved lime) showed stronger advancements in their timing of flowering than the herbaceous species (dandelion, lily of the valley). In later decades (1991-2010), more pronounced phenological change was detected than during the earlier years (1970-1990), which indicates the increased influence of human induced higher spring temperatures in the late twentieth century.
Collapse
|
16
|
Robust activation detection methods for real-time and offline fMRI analysis. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 144:1-11. [PMID: 28494993 DOI: 10.1016/j.cmpb.2017.03.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Revised: 03/13/2017] [Accepted: 03/16/2017] [Indexed: 06/07/2023]
Abstract
We propose two contributions with novel approaches to fMRI activation analysis. The first is to apply confidence intervals to locate activations in real-time, and second is a new metric based on robust regression of fMRI signals. These contributions are implemented in our four proposed methods; Instantaneous Activation Method (IAM), Instantaneous Activation Method with Past Blocks (IAMP) for real-time analysis, Task Robust Regression Distance Method (TRRD) for the new metric with robust regression and Instantaneous Robust Regression Distance Method (IRRD) for both contributions. For comparison, a statistical offline method called Task Activation Method (TAM) and a correlation analysis method are also implemented. The methods are initially evaluated with synthetic data generated using two different approaches; first using varying hemodynamic response function signals to simulate a wide range of stimuli responses, along with a Gaussian white noise, and second using no activity state data of a real fMRI experiment, which removes the need to generate noise. The methods are also tested with real fMRI experiments and compared with the results obtained by the widely used SPM tool. The results show that instantaneous methods reveal activations that are lost statistically in an offline analysis. They also reveal further improvements by robust fitting application, which minimizes the outlier effect. TRRD has an area under the ROC curve of 0,7127 for very noisy synthetic images, is reaching up to 0,9608 as the noise decreases, while the instantaneous score is in the range of 0,6124 to 0,8019 in the same noise levels.
Collapse
|
17
|
Abstract
Finite mixture of regression (FMR) models can be reformulated as incomplete data problems and they can be estimated via the expectation-maximization (EM) algorithm. The main drawback is the strong parametric assumption such as FMR models with normal distributed residuals. The estimation might be biased if the model is misspecified. To relax the parametric assumption about the component error densities, a new method is proposed to estimate the mixture regression parameters by only assuming that the components have log-concave error densities but the specific parametric family is unknown. Two EM-type algorithms for the mixtures of regression models with log-concave error densities are proposed. Numerical studies are made to compare the performance of our algorithms with the normal mixture EM algorithms. When the component error densities are not normal, the new methods have much smaller MSEs when compared with the standard normal mixture EM algorithms. When the underlying component error densities are normal, the new methods have comparable performance to the normal EM algorithm.
Collapse
|
18
|
A comprehensive review of group level model performance in the presence of heteroscedasticity: Can a single model control Type I errors in the presence of outliers? Neuroimage 2016; 147:658-668. [PMID: 28030782 DOI: 10.1016/j.neuroimage.2016.12.058] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Revised: 12/16/2016] [Accepted: 12/20/2016] [Indexed: 10/20/2022] Open
Abstract
Even after thorough preprocessing and a careful time series analysis of functional magnetic resonance imaging (fMRI) data, artifact and other issues can lead to violations of the assumption that the variance is constant across subjects in the group level model. This is especially concerning when modeling a continuous covariate at the group level, as the slope is easily biased by outliers. Various models have been proposed to deal with outliers including models that use the first level variance or that use the group level residual magnitude to differentially weight subjects. The most typically used robust regression, implementing a robust estimator of the regression slope, has been previously studied in the context of fMRI studies and was found to perform well in some scenarios, but a loss of Type I error control can occur for some outlier settings. A second type of robust regression using a heteroscedastic autocorrelation consistent (HAC) estimator, which produces robust slope and variance estimates has been shown to perform well, with better Type I error control, but with large sample sizes (500-1000 subjects). The Type I error control with smaller sample sizes has not been studied in this model and has not been compared to other modeling approaches that handle outliers such as FSL's Flame 1 and FSL's outlier de-weighting. Focusing on group level inference with a continuous covariate over a range of sample sizes and degree of heteroscedasticity, which can be driven either by the within- or between-subject variability, both styles of robust regression are compared to ordinary least squares (OLS), FSL's Flame 1, Flame 1 with outlier de-weighting algorithm and Kendall's Tau. Additionally, subject omission using the Cook's Distance measure with OLS and nonparametric inference with the OLS statistic are studied. Pros and cons of these models as well as general strategies for detecting outliers in data and taking precaution to avoid inflated Type I error rates are discussed.
Collapse
|
19
|
Evaluation of the Relationship between Social Desirability and Minor Psychiatric Disorders among Nurses in Southern Iran: A Robust Regression Approach. INTERNATIONAL JOURNAL OF COMMUNITY BASED NURSING AND MIDWIFERY 2015; 3:301-8. [PMID: 26448957 PMCID: PMC4591569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Subscribe] [Scholar Register] [Received: 03/10/2015] [Revised: 05/09/2015] [Accepted: 06/01/2015] [Indexed: 11/06/2022]
Abstract
BACKGROUND Social desirability may affect different aspects of people's quality of life. One of the impressive dimensions of quality of life is mental health. The prevalence of Minor Psychiatric Disorders (MPD) among health care workers is higher than other health workers. This article aims at evaluating the relationship between social desirability and MPD among nurses in southern Iran. METHOD A cross-sectional study was carried out on 765 nurses who had been employed in hospitals in the southern provinces of Iran. The 12-item General Health Questionnaire (GHQ-12) and Marlowe-Crowne Social Desirability Scale (MC-SDS) were used for evaluating the MPD and social desirability in nurses, respectively. The Robust Regression was used to determine any quantified relationship between social desirability and the level of MPD with adjusted age, gender, work experience, marital status, and level of education. RESULT The mean scores of GHQ-12 and MC-SDS were 13.02±5.64 (out of 36) and 20.17±4.76 (out of 33), respectively. The result of Robust Regression indicated that gender and social desirability were statistically significant in affecting MPD. CONCLUSION The prevalence of MPD in female nurses was higher than males. Nurses with higher social desirability scores had the tendency to report lower levels of MPD.
Collapse
|
20
|
Robust regression methods for real-time polymerase chain reaction. Anal Biochem 2015; 480:34-6. [PMID: 25862086 DOI: 10.1016/j.ab.2015.04.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2015] [Revised: 03/27/2015] [Accepted: 04/02/2015] [Indexed: 11/24/2022]
Abstract
Current real-time polymerase chain reaction (PCR) data analysis methods implement linear least squares regression methods for primer efficiency estimation based on standard curve dilution series. This method is sensitive to outliers that distort the outcome and are often ignored or removed by the end user. Here, robust regression methods are shown to provide a reliable alternative because they are less affected by outliers and often result in more precise primer efficiency estimators than the linear least squares method.
Collapse
|
21
|
Abstract
Authors have observed that the distribution of medical expenditures has features that do not lend it to parametric modeling and can present significant challenges for least-squares-type estimators, even on a logarithmic scale. In this note, we discuss caveats and extensions of coefficient estimation in the bivariate accelerated lifetime model of medical cost and survival time on covariates. We consider the setting where medical cost is observed only when the event occurs and potential right-censoring of the event time induces a dependent censoring mechanism on cost. We adopt Huang's (2002) estimation framework using the weighted log-rank estimating equations and investigate his proposal for robust mark-scale coefficient estimation. Due to modeling restrictions on the joint distribution of survival time and cost, we conclude that his robust mark-scale coefficient estimator would benefit from a time-scale adjustment. We use basic principles from robust estimation to define a new weighted marked process that subsequently leads to a new time-corrected robust regression calibration estimator. Our simulation studies illustrate how the proposed estimator has desirable operating characteristics, including reduced sensitivity to extreme values in the cost distribution, smaller finite sample bias and variance than earlier proposals. We illustrate the method in an analysis of lifetime medical cost data from a lung cancer study conducted by the Southwest Oncology Group.
Collapse
|
22
|
Robust regression for large-scale neuroimaging studies. Neuroimage 2015; 111:431-41. [PMID: 25731989 DOI: 10.1016/j.neuroimage.2015.02.048] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2014] [Revised: 02/09/2015] [Accepted: 02/19/2015] [Indexed: 02/06/2023] Open
Abstract
Multi-subject datasets used in neuroimaging group studies have a complex structure, as they exhibit non-stationary statistical properties across regions and display various artifacts. While studies with small sample sizes can rarely be shown to deviate from standard hypotheses (such as the normality of the residuals) due to the poor sensitivity of normality tests with low degrees of freedom, large-scale studies (e.g. >100 subjects) exhibit more obvious deviations from these hypotheses and call for more refined models for statistical inference. Here, we demonstrate the benefits of robust regression as a tool for analyzing large neuroimaging cohorts. First, we use an analytic test based on robust parameter estimates; based on simulations, this procedure is shown to provide an accurate statistical control without resorting to permutations. Second, we show that robust regression yields more detections than standard algorithms using as an example an imaging genetics study with 392 subjects. Third, we show that robust regression can avoid false positives in a large-scale analysis of brain-behavior relationships with over 1500 subjects. Finally we embed robust regression in the Randomized Parcellation Based Inference (RPBI) method and demonstrate that this combination further improves the sensitivity of tests carried out across the whole brain. Altogether, our results show that robust procedures provide important advantages in large-scale neuroimaging group studies.
Collapse
|
23
|
Calibration transfer between electronic nose systems for rapid in situ measurement of pulp and paper industry emissions. Anal Chim Acta 2014; 841:58-67. [PMID: 25109862 DOI: 10.1016/j.aca.2014.05.054] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2013] [Revised: 05/29/2014] [Accepted: 05/31/2014] [Indexed: 11/23/2022]
Abstract
Electronic nose systems when deployed in network mesh can effectively provide a low budget and onsite solution for the industrial obnoxious gaseous measurement. For accurate and identical prediction capability by all the electronic nose systems, a reliable calibration transfer model needs to be implemented in order to overcome the inherent sensor array variability. In this work, robust regression (RR) is used for calibration transfer between two electronic nose systems using a Box-Behnken (BB) design. Out of the two electronic nose systems, one was trained using industrial gas samples by four artificial neural network models, for the measurement of obnoxious odours emitted from pulp and paper industries. The emissions constitute mainly of hydrogen sulphide (H2S), methyl mercaptan (MM), dimethyl sulphide (DMS) and dimethyl disulphide (DMDS) in different proportions. A Box-Behnken design consisting of 27 experiment sets based on synthetic gas combinations of H2S, MM, DMS and DMDS, were conducted for calibration transfer between two identical electronic nose systems. Identical sensors on both the systems were mapped and the prediction models developed using ANN were then transferred to the second system using BB-RR methodology. The results showed successful transmission of prediction models developed for one system to other system, with the mean absolute error between the actual and predicted concentration of analytes in mg L(-1) after calibration transfer (on second system) being 0.076, 0.1801, 0.0329, 0.427 for DMS, DMDS, MM, H2S respectively.
Collapse
|
24
|
A comparison of nonparametric and parametric methods to adjust for baseline measures. Contemp Clin Trials 2014; 37:225-33. [PMID: 24462567 DOI: 10.1016/j.cct.2014.01.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2013] [Revised: 01/08/2014] [Accepted: 01/13/2014] [Indexed: 11/28/2022]
Abstract
When analyzing the randomized controlled trial, we may employ various statistical methods to adjust for baseline measures. Depending on the method chosen to adjust for baseline measures, inferential results can vary. We investigate the Type 1 error and statistical power of tests comparing treatment outcomes based on parametric and nonparametic methods. We also explore the increasing levels of correlation between baseline and changes from the baseline, with or without underlying normality. These methods are illustrated and compared via simulations.
Collapse
|
25
|
Abstract
Large- and finite-sample efficiency and resistance to outliers are the key goals of robust statistics. Although often not simultaneously attainable, we develop and study a linear regression estimator that comes close. Efficiency obtains from the estimator's close connection to generalized empirical likelihood, and its favorable robustness properties are obtained by constraining the associated sum of (weighted) squared residuals. We prove maximum attainable finite-sample replacement breakdown point, and full asymptotic efficiency for normal errors. Simulation evidence shows that compared to existing robust regression estimators, the new estimator has relatively high efficiency for small sample sizes, and comparable outlier resistance. The estimator is further illustrated and compared to existing methods via application to a real data set with purported outliers.
Collapse
|
26
|
Factors associated with aerobic fitness in adolescents with asthma. Respir Med 2013; 107:1164-71. [PMID: 23632101 DOI: 10.1016/j.rmed.2013.04.009] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/16/2012] [Revised: 04/01/2013] [Accepted: 04/08/2013] [Indexed: 11/20/2022]
Abstract
BACKGROUND In adolescents with asthma, information on factors associated with cardiorespiratory fitness levels is limited. The present study aimed to determine if objectively measured physical activity as well as potential relevant factors such as lung function, asthma exacerbations, use of inhaled corticosteroids or skin fold thickness are associated with direct measurements of peak oxygen uptake (V˙O2peak) in adolescents with asthma. METHODS From a nested case-control study at 13-years in the Environment and Childhood Asthma birth cohort study in Oslo, Norway, 86 13-years old adolescents with and 76 without asthma performed maximal running on a treadmill with V˙O2peak measured. The sum of four skin fold thicknesses was recorded, followed by wearing an activity monitor for four consecutive days. Lung function was measured by maximum forced expiratory flow-volume curves and body plethysmography. Asthma exacerbations and use of medication were registered by parental structured interview. Data were analysed using multiple regression analysis. RESULTS Vigorous physical activity (coefficients with 95% confidence intervals; 1.73 (0.32, 3.14)) and skin fold thickness -0.35 (-0.41, -0.28)) were significantly associated with V˙O2peak in adolescents with asthma. Neither use of inhaled corticosteroids, lung function nor number of asthma exacerbations was associated with V˙O2peak when taking physical activity and skin fold thickness into account. In the adolescents without asthma only skin fold thicknesses was negatively associated with V˙O2peak -3.5 (-4.1, -2.8). CONCLUSIONS V˙O2peak appears to be determined by vigorous physical activity level in Norwegian adolescents with asthma and not by asthma-related factors such as use of inhaled corticosteroids, lung function nor number of asthma exacerbations.
Collapse
|
27
|
Abstract
Robust variable selection procedures through penalized regression have been gaining increased attention in the literature. They can be used to perform variable selection and are expected to yield robust estimates. However, to the best of our knowledge, the robustness of those penalized regression procedures has not been well characterized. In this paper, we propose a class of penalized robust regression estimators based on exponential squared loss. The motivation for this new procedure is that it enables us to characterize its robustness that has not been done for the existing procedures, while its performance is near optimal and superior to some recently developed methods. Specifically, under defined regularity conditions, our estimators are [Formula: see text] and possess the oracle property. Importantly, we show that our estimators can achieve the highest asymptotic breakdown point of 1/2 and that their influence functions are bounded with respect to the outliers in either the response or the covariate domain. We performed simulation studies to compare our proposed method with some recent methods, using the oracle method as the benchmark. We consider common sources of influential points. Our simulation studies reveal that our proposed method performs similarly to the oracle method in terms of the model error and the positive selection rate even in the presence of influential points. In contrast, other existing procedures have a much lower non-causal selection rate. Furthermore, we re-analyze the Boston Housing Price Dataset and the Plasma Beta-Carotene Level Dataset that are commonly used examples for regression diagnostics of influential points. Our analysis unravels the discrepancies of using our robust method versus the other penalized regression method, underscoring the importance of developing and applying robust penalized regression methods.
Collapse
|