1
|
The State of the Art of Data Mining Algorithms for Predicting the COVID-19 Pandemic. AXIOMS 2022. [DOI: 10.3390/axioms11050242] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Current computer systems are accumulating huge amounts of information in several application domains. The outbreak of COVID-19 has increased rekindled interest in the use of data mining techniques for the analysis of factors that are related to the emergence of an epidemic. Data mining techniques are being used in the analysis and interpretation of information, which helps in the discovery of patterns, planning of isolation policies, and even predicting the speed of proliferation of contagion in a viral disease such as COVID-19. This research provides a comprehensive study of various data mining algorithms that are used in conjunction with epidemiological prediction models. The document considers that there is an opportunity to improve or develop tools that offer an accurate prognosis in the management of viral diseases through the use of data mining tools, based on a comparative study of 35 research papers.
Collapse
|
2
|
Using Nursing Information and Data Mining to Explore the Factors That Predict Pressure Injuries for Patients at the End of Life. Comput Inform Nurs 2019; 37:133-141. [PMID: 30418245 DOI: 10.1097/cin.0000000000000489] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
This study investigated the association between patient characteristics and the occurrence of pressure injuries for patients at the end of life. A retrospective study was conducted using data collected from 2062 patients at the end of life between January 2007 and October 2015. In addition to demographic data and pressure injury risk assessment scale scores, injury history, disease type, and length of hospitalization were revealed as the major independent variables for predicting the occurrence of pressure injuries. Both χ tests and t tests were employed for binary variable analysis, and logistic regression was used to conduct multivariate analysis. Classification models were formulated through decision tree analysis, backpropagation neural network, and support vector machine algorithms. The rules obtained using the decision tree algorithm were analyzed and interpreted. The accuracy rate, sensitivity, and specificity of the decision tree, backpropagation neural network, and support vector machine algorithms were 77.15%, 79.54%, and 74.76%; 78.12%, 81.37%, and 74.85%; and 79.32%, 81.03%, and 78.75%, respectively. The predictive factors, ranked in order of importance, were history of pressure injuries, without cancer, excretion, activity/mobility, and skin condition/circulation. These were the primary shared risk factors among the four models used in this study.
Collapse
|
3
|
Mansiaux Y, Carrat F. Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections. BMC Med Res Methodol 2014; 14:99. [PMID: 25154404 PMCID: PMC4146451 DOI: 10.1186/1471-2288-14-99] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2014] [Accepted: 08/14/2014] [Indexed: 12/19/2022] Open
Abstract
Background Big data is steadily growing in epidemiology. We explored the performances of methods dedicated to big data analysis for detecting independent associations between exposures and a health outcome. Methods We searched for associations between 303 covariates and influenza infection in 498 subjects (14% infected) sampled from a dedicated cohort. Independent associations were detected using two data mining methods, the Random Forests (RF) and the Boosted Regression Trees (BRT); the conventional logistic regression framework (Univariate Followed by Multivariate Logistic Regression - UFMLR) and the Least Absolute Shrinkage and Selection Operator (LASSO) with penalty in multivariate logistic regression to achieve a sparse selection of covariates. We developed permutations tests to assess the statistical significance of associations. We simulated 500 similar sized datasets to estimate the True (TPR) and False (FPR) Positive Rates associated with these methods. Results Between 3 and 24 covariates (1%-8%) were identified as associated with influenza infection depending on the method. The pre-seasonal haemagglutination inhibition antibody titer was the unique covariate selected with all methods while 266 (87%) covariates were not selected by any method. At 5% nominal significance level, the TPR were 85% with RF, 80% with BRT, 26% to 49% with UFMLR, 71% to 78% with LASSO. Conversely, the FPR were 4% with RF and BRT, 9% to 2% with UFMLR, and 9% to 4% with LASSO. Conclusions Data mining methods and LASSO should be considered as valuable methods to detect independent associations in large epidemiologic datasets.
Collapse
Affiliation(s)
- Yohann Mansiaux
- INSERM, UMR_S 1136, Institut Pierre Louis d'Epidémiologie et de Santé Publique, F-75013 Paris, France.
| | | |
Collapse
|
4
|
Oquendo MA, Baca-Garcia E, Artés-Rodríguez A, Perez-Cruz F, Galfalvy HC, Blasco-Fontecilla H, Madigan D, Duan N. Machine learning and data mining: strategies for hypothesis generation. Mol Psychiatry 2012; 17:956-9. [PMID: 22230882 DOI: 10.1038/mp.2011.173] [Citation(s) in RCA: 62] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Strategies for generating knowledge in medicine have included observation of associations in clinical or research settings and more recently, development of pathophysiological models based on molecular biology. Although critically important, they limit hypothesis generation to an incremental pace. Machine learning and data mining are alternative approaches to identifying new vistas to pursue, as is already evident in the literature. In concert with these analytic strategies, novel approaches to data collection can enhance the hypothesis pipeline as well. In data farming, data are obtained in an 'organic' way, in the sense that it is entered by patients themselves and available for harvesting. In contrast, in evidence farming (EF), it is the provider who enters medical data about individual patients. EF differs from regular electronic medical record systems because frontline providers can use it to learn from their own past experience. In addition to the possibility of generating large databases with farming approaches, it is likely that we can further harness the power of large data sets collected using either farming or more standard techniques through implementation of data-mining and machine-learning strategies. Exploiting large databases to develop new hypotheses regarding neurobiological and genetic underpinnings of psychiatric illness is useful in itself, but also affords the opportunity to identify novel mechanisms to be targeted in drug discovery and development.
Collapse
Affiliation(s)
- M A Oquendo
- Department of Psychiatry, New York State Psychiatric Institute and Columbia University, New York, NY 10032, USA.
| | | | | | | | | | | | | | | |
Collapse
|
5
|
Seo ST, Lee IH, Son CS, Park HJ, Park HS, Yoon HJ, Kim YN. Support Vector Regression-based Model to Analyze Prognosis of Infants with Congenital Muscular Torticollis. Healthc Inform Res 2011; 16:224-30. [PMID: 21818442 PMCID: PMC3091983 DOI: 10.4258/hir.2010.16.4.224] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2010] [Accepted: 10/20/2010] [Indexed: 11/23/2022] Open
Abstract
Objectives Congenital muscular torticollis, a common disorder that refers to the shortening of the sternocleidomastoid in infants, is sensitive to correction through physical therapy when treated early. If physical therapy is unsuccessful, surgery is required. In this study, we developed a support vector regression model for congenital muscular torticollis to investigate the prognosis of the physical therapy treatent in infants. Methods Fifty-nine infants with congenital muscular torticollis received physical therapy until the degree of neck tilt was less than 5°. After treatment, the mass diameter was reevaluated. Based on the data, a support vector regression model was applied to predict the prognoses. Results 10-, 20-, and 50-fold cross-tabulation analyses for the proposed model were conducted based on support vector regression and conventional multi-regression method based on least squares. The proposed methodbased on support vector regression was robust and enabled the effective analysis of even a small amount of data containing outliers. Conclusions The developed support vector regression model is an effective prognostic tool for infants with congenital muscular torticollis who receive physical therapy.
Collapse
Affiliation(s)
- Suk-Tae Seo
- Biomedical Information Technology Center, Keimyung University, Daegu, Korea
| | | | | | | | | | | | | |
Collapse
|
6
|
Trifirò G, Pariente A, Coloma PM, Kors JA, Polimeni G, Miremont-Salamé G, Catania MA, Salvo F, David A, Moore N, Caputi AP, Sturkenboom M, Molokhia M, Hippisley-Cox J, Acedo CD, van der Lei J, Fourrier-Reglat A. Data mining on electronic health record databases for signal detection in pharmacovigilance: which events to monitor? Pharmacoepidemiol Drug Saf 2010; 18:1176-84. [PMID: 19757412 DOI: 10.1002/pds.1836] [Citation(s) in RCA: 136] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
PURPOSE Data mining on electronic health records (EHRs) has emerged as a promising complementary method for post-marketing drug safety surveillance. The EU-ADR project, funded by the European Commission, is developing techniques that allow mining of EHRs for adverse drug events across different countries in Europe. Since mining on all possible events was considered to unduly increase the number of spurious signals, we wanted to create a ranked list of high-priority events. METHODS Scientific literature, medical textbooks, and websites of regulatory agencies were reviewed to create a preliminary list of events that are deemed important in pharmacovigilance. Two teams of pharmacovigilance experts independently rated each event on five criteria: 'trigger for drug withdrawal', 'trigger for black box warning', 'leading to emergency department visit or hospital admission', 'probability of event to be drug-related', and 'likelihood of death'. In case of disagreement, a consensus score was obtained. Ordinal scales between 0 and 3 were used for rating the criteria, and an overall score was computed to rank the events. RESULTS An initial list comprising 23 adverse events was identified. After rating all the events and calculation of overall scores, a ranked list was established. The top-ranking events were: cutaneous bullous eruptions, acute renal failure, anaphylactic shock, acute myocardial infarction, and rhabdomyolysis. CONCLUSIONS A ranked list of 23 adverse drug events judged as important in pharmacovigilance was created to permit focused data mining. The list will need to be updated periodically as knowledge on drug safety evolves and new issues in drug safety arise.
Collapse
Affiliation(s)
- Gianluca Trifirò
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Baca-Garcia E, Vaquero-Lorenzo C, Perez-Rodriguez MM, Gratacòs M, Bayés M, Santiago-Mozos R, Leiva-Murillo JM, de Prado-Cumplido M, Artes-Rodriguez A, Ceverino A, Diaz-Sastre C, Fernandez-Navarro P, Costas J, Fernandez-Piqueras J, Diaz-Hernandez M, de Leon J, Baca-Baldomero E, Saiz-Ruiz J, Mann JJ, Parsey RV, Carracedo A, Estivill X, Oquendo MA. Nucleotide variation in central nervous system genes among male suicide attempters. Am J Med Genet B Neuropsychiatr Genet 2010; 153B:208-13. [PMID: 19455598 DOI: 10.1002/ajmg.b.30975] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Despite marked morbidity and mortality associated with suicidal behavior, accurate identification of individuals at risk remains elusive. The goal of this study is to identify a model based on single nucleotide polymorphisms (SNPs) that discriminates between suicide attempters and non-attempters using data mining strategies. We examined functional SNPs (n = 840) of 312 brain function and development genes using data mining techniques. Two hundred seventy-seven male psychiatric patients aged 18 years or older were recruited at a University hospital psychiatric emergency room or psychiatric short stay unit. The main outcome measure was history of suicide attempts. Three SNPs of three genes (rs10944288, HTR1E; hCV8953491, GABRP; and rs707216, ACTN2) correctly classified 67% of male suicide attempters and non-attempters (0.50 sensitivity, 0.82 specificity, positive likelihood ratio = 2.80, negative likelihood ratio = 1.64). The OR for the combined three SNPs was 4.60 (95% CI: 1.31-16.10). The model's accuracy suggests that in the future similar methodologies may generate simple genetic tests with diagnostic utility in identification of suicide attempters. This strategy may uncover new pathophysiological pathways regarding the neurobiology of suicidal acts.
Collapse
Affiliation(s)
- Enrique Baca-Garcia
- Department of Psychiatry at Fundacion Jimenez Diaz Hospital, Autonoma University, Madrid, Spain.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Abstract
The focus of this article is to provide an overview of the current technologies for the pharmaceutical and biotech industry. Disease processes express themselves in the functional and structural disturbance of cellular systems. Cells and their metabolites constitute the building blocks of tissues and entire organisms. Studying the spatial and temporal phenotype of disease processes in tissues at the cellular level reveals a multitude of information about the progress and status of a disease. Detailed exploration of tissues by slide-based cytometry is an important source of information about disease processes. Technological and analytical advances allow us to shed a new light on tissues and to come to a better understanding of the complexity of disease processes. Dealing with complex multidimensional datasets from tissue samples requires an advanced approach to image processing and data management. The increase in computing power and the continuing research into imaging algorithms allow us to improve the exploration of the data content of tissues.
Collapse
|
9
|
RODIN ANDREI, MOSLEY THOMASH, CLARK ANDREWG, SING CHARLESF, BOERWINKLE ERIC. Mining genetic epidemiology data with Bayesian networks application to APOE gene variation and plasma lipid levels. J Comput Biol 2005; 12:1-11. [PMID: 15725730 PMCID: PMC1201451 DOI: 10.1089/cmb.2005.12.1] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
There is a critical need for data-mining methods that can identify SNPs that predict among individual variation in a phenotype of interest and reverse-engineer the biological network of relationships between SNPs, phenotypes, and other factors. This problem is both challenging and important in light of the large number of SNPs in many genes of interest and across the human genome. A potentially fruitful form of exploratory data analysis is the Bayesian or Belief network. A Bayesian or Belief network provides an analytic approach for identifying robust predictors of among-individual variation in a disease endpoints or risk factor levels. We have applied Belief networks to SNP variation in the human APOE gene and plasma apolipoprotein E levels from two samples: 702 African-Americans from Jackson, MS, and 854 non-Hispanic whites from Rochester, MN. Twenty variable sites in the APOE gene were genotyped in both samples. In Jackson, MS, SNPs 4036 and 4075 were identified to influence plasma apoE levels. In Rochester, MN, SNPs 3937 and 4075 were identified to influence plasma apoE levels. All three SNPs had been previously implicated in affecting measures of lipid and lipoprotein metabolism. Like all data-mining methods, Belief networks are meant to complement traditional hypothesis-driven methods of data analysis. These results document the utility of a Belief network approach for mining large scale genotype-phenotype association data.
Collapse
Affiliation(s)
- ANDREI RODIN
- Human Genetics Center, University of Texas Health Science Center, Houston, TX
| | - THOMAS H. MOSLEY
- Department of Medicine, University of Mississippi Medical Center, Jackson, MS
| | - ANDREW G. CLARK
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY
| | - CHARLES F. SING
- Department of Human Genetics, University of Michigan, Ann Arbor, MI
| | - ERIC BOERWINKLE
- Human Genetics Center, University of Texas Health Science Center, Houston, TX
- Institute of Molecular Medicine, University of Texas Health Science Center at Houston, Houston, TX
- Address correspondence to: Eric Boerwinkle, Human Genetics Center, 1200 Herman Pressler Drive, Suite E447, Houston, TX 77030, E-mail:
| |
Collapse
|