1
|
Al‐Mamun HA, Danilevicz MF, Marsh JI, Gondro C, Edwards D. Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large-scale soybean dataset. THE PLANT GENOME 2025; 18:e20503. [PMID: 39253773 PMCID: PMC11726426 DOI: 10.1002/tpg2.20503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 07/15/2024] [Accepted: 07/15/2024] [Indexed: 09/11/2024]
Abstract
The surge in high-throughput technologies has empowered the acquisition of vast genomic datasets, prompting the search for genetic markers and biomarkers relevant to complex traits. However, grappling with the inherent complexities of high dimensionality and sparsity within these datasets poses formidable hurdles. The immense number of features and their potential redundancy demand efficient strategies for extracting pertinent information and identifying significant markers. Feature selection is important in large genomic data as it helps in enhancing interpretability and computational efficiency. This study focuses on addressing these challenges through a comprehensive investigation into genomic feature selection methodologies, employing a rich soybean (Glycine max L. Merr.) dataset comprising 966 lines with over 5.5 million single nucleotide polymorphisms. Emphasizing the "small n large p" dilemma prevalent in contemporary genomic studies, we compared the efficacy of traditional genome-wide association studies (GWAS) with two prominent machine learning tools, random forest and extreme gradient boosting, in pinpointing predictive features. Utilizing the expansive soybean dataset, we assessed the performance of these methodologies in selecting features that optimize predictive modeling for various phenotypes. By constructing predictive models based on the selected features, we ascertain the comparative prediction accuracies, thereby illuminating the strengths and limitations of these feature selection methodologies in the realm of genomic data analysis.
Collapse
Affiliation(s)
- Hawlader A. Al‐Mamun
- Centre for Applied Bioinformaticsand School of Biological SciencesUniversity of Western AustraliaPerthWestern AustraliaAustralia
| | - Monica F. Danilevicz
- Centre for Applied Bioinformaticsand School of Biological SciencesUniversity of Western AustraliaPerthWestern AustraliaAustralia
| | - Jacob I. Marsh
- Department of BiologyUniversity of North CarolinaChapel HillNorth CarolinaUSA
| | - Cedric Gondro
- Department of Animal ScienceMichigan State UniversityEast LansingMichiganUSA
| | - David Edwards
- Centre for Applied Bioinformaticsand School of Biological SciencesUniversity of Western AustraliaPerthWestern AustraliaAustralia
| |
Collapse
|
2
|
Marchitelli S, Mazza C, Ricci E, Faia V, Biondi S, Colasanti M, Cardinale A, Roma P, Tambelli R. Identification of Psychological Treatment Dropout Predictors Using Machine Learning Models on Italian Patients Living with Overweight and Obesity Ineligible for Bariatric Surgery. Nutrients 2024; 16:2605. [PMID: 39203742 PMCID: PMC11357013 DOI: 10.3390/nu16162605] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 07/30/2024] [Accepted: 08/02/2024] [Indexed: 09/03/2024] Open
Abstract
According to the main international guidelines, patients with obesity and psychiatric/psychological disorders who cannot be addressed to surgery are recommended to follow a nutritional approach and a psychological treatment. A total of 94 patients (T0) completed a battery of self-report measures: Symptom Checklist-90-Revised (SCL-90-R), Barratt Impulsiveness Scale-11 (BIS-11), Binge-Eating Scale (BES), Obesity-Related Well-Being Questionnaire-97 (ORWELL-97), and Minnesota Multiphasic Personality Inventory-2 (MMPI-2). Then, twelve sessions of a brief psychodynamic psychotherapy were delivered, which was followed by the participants completing the follow-up evaluation (T1). Two groups of patients were identified: Group 1 (n = 65), who fully completed the assessment in both T0 and T1; and Group 2-dropout (n = 29), who fulfilled the assessment only at T0 and not at T1. Machine learning models were implemented to investigate which variables were most associated with treatment failure. The classification tree model identified patients who were dropping out of treatment with an accuracy of about 80% by considering two variables: the MMPI-2 Correction (K) scale and the SCL-90-R Phobic Anxiety (PHOB) scale. Given the limited number of studies on this topic, the present results highlight the importance of considering the patient's level of adaptation and the social context in which they are integrated in treatment planning. Cautionary notes, implications, and future directions are discussed.
Collapse
Affiliation(s)
- Serena Marchitelli
- UOC of Endocrinology, Metabolic Diseases, Andrology—CASCO (Center of High Specialization for the Treatment of Obesity), Policlinico Umberto I, Sapienza University of Rome, 00161 Rome, Italy;
| | - Cristina Mazza
- Department of Dynamic and Clinical Psychology, & Health Studies, Sapienza University of Rome, Via degli Apuli 1, 00185 Rome, Italy;
| | - Eleonora Ricci
- Department of Neuroscience, Imaging and Clinical Sciences, University “G.d’Annunzio”, 66100 Chieti-Pescara, Italy;
| | - Valentina Faia
- The Free Spirit Collective Polyclinic, Dubai 252330, United Arab Emirates;
| | - Silvia Biondi
- Department of Human Neuroscience, Sapienza University of Rome, 00185 Rome, Italy; (S.B.); (P.R.)
| | - Marco Colasanti
- Department of Psychological, Health and Territorial Sciences, University “G.d’Annunzio”, 66100 Chieti-Pescara, Italy;
| | | | - Paolo Roma
- Department of Human Neuroscience, Sapienza University of Rome, 00185 Rome, Italy; (S.B.); (P.R.)
| | - Renata Tambelli
- Department of Dynamic and Clinical Psychology, & Health Studies, Sapienza University of Rome, Via degli Apuli 1, 00185 Rome, Italy;
| |
Collapse
|
3
|
Montesinos-López OA, Crespo-Herrera L, Pierre CS, Cano-Paez B, Huerta-Prado GI, Mosqueda-González BA, Ramos-Pulido S, Gerard G, Alnowibet K, Fritsche-Neto R, Montesinos-López A, Crossa J. Feature engineering of environmental covariates improves plant genomic-enabled prediction. FRONTIERS IN PLANT SCIENCE 2024; 15:1349569. [PMID: 38812738 PMCID: PMC11135473 DOI: 10.3389/fpls.2024.1349569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 04/11/2024] [Indexed: 05/31/2024]
Abstract
Introduction Because Genomic selection (GS) is a predictive methodology, it needs to guarantee high-prediction accuracies for practical implementations. However, since many factors affect the prediction performance of this methodology, its practical implementation still needs to be improved in many breeding programs. For this reason, many strategies have been explored to improve the prediction performance of this methodology. Methods When environmental covariates are incorporated as inputs in the genomic prediction models, this information only sometimes helps increase prediction performance. For this reason, this investigation explores the use of feature engineering on the environmental covariates to enhance the prediction performance of genomic prediction models. Results and discussion We found that across data sets, feature engineering helps reduce prediction error regarding only the inclusion of the environmental covariates without feature engineering by 761.625% across predictors. These results are very promising regarding the potential of feature engineering to enhance prediction accuracy. However, since a significant gain in prediction accuracy was observed in only some data sets, further research is required to guarantee a robust feature engineering strategy to incorporate the environmental covariates.
Collapse
Affiliation(s)
| | | | - Carolina Saint Pierre
- International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Edo. de Mexico, Mexico
| | - Bernabe Cano-Paez
- Facultad de Ciencias, Universidad Nacioanl Autónoma de México (UNAM), México City, Mexico
| | | | | | - Sofia Ramos-Pulido
- Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, Guadalajara, Jalisco, Mexico
| | - Guillermo Gerard
- International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Edo. de Mexico, Mexico
| | - Khalid Alnowibet
- Department of Statistics and Operations Research, King Saud University, Riyah, Saudi Arabia
| | | | - Abelardo Montesinos-López
- Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, Guadalajara, Jalisco, Mexico
| | - José Crossa
- International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Edo. de Mexico, Mexico
- Louisiana State University, Baton Rouge, LA, United States
- Distinguished Scientist Fellowship Program, King Saud University, Riyah, Saudi Arabia
- Instituto de Socieconomia, Estadistica e Informatica, Colegio de Postgraduados, Montecillos, Edo. de México, Texcoco, Mexico
| |
Collapse
|
4
|
Shi H, Zhang Y, Yu Z, Yang Y. Reservoir temperature prediction based on characterization of water chemistry data-case study of western Anatolia, Turkey. Sci Rep 2024; 14:10339. [PMID: 38710719 DOI: 10.1038/s41598-024-59409-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 04/10/2024] [Indexed: 05/08/2024] Open
Abstract
Reservoir temperature estimation is crucial for geothermal studies, but traditional methods are complex and uncertain. To address this, we collected 83 sets of water chemistry and reservoir temperature data and applied four machine learning algorithms. These models considered various input factors and underwent data preprocessing steps like null value imputation, normalization, and Pearson coefficient calculation. Cross-validation addressed data volume issues, and performance metrics were used for model evaluation. The results revealed that our machine learning models outperformed traditional fluid geothermometers. All machine learning models surpassed traditional methods. The XGBoost model, based on the F-3 combination, demonstrated the best prediction accuracy with an R2 of 0.9732, while the Bayesian ridge regression model using the F-4 combination had the lowest performance with an R2 of 0.8302. This study highlights the potential of machine learning for accurate reservoir temperature prediction, offering geothermal professionals a reliable tool for model selection and advancing our understanding of geothermal resources.
Collapse
Affiliation(s)
- Haoxin Shi
- College of Construction Engineering, Jilin University, Changchun, 130026, China
| | - Yanjun Zhang
- College of Construction Engineering, Jilin University, Changchun, 130026, China.
| | - Ziwang Yu
- College of Construction Engineering, Jilin University, Changchun, 130026, China
| | - Yunxing Yang
- College of Construction Engineering, Jilin University, Changchun, 130026, China
| |
Collapse
|
5
|
Alireza Z, Maleeha M, Kaikkonen M, Fortino V. Enhancing prediction accuracy of coronary artery disease through machine learning-driven genomic variant selection. J Transl Med 2024; 22:356. [PMID: 38627847 PMCID: PMC11020205 DOI: 10.1186/s12967-024-05090-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 03/14/2024] [Indexed: 04/19/2024] Open
Abstract
Machine learning (ML) methods are increasingly becoming crucial in genome-wide association studies for identifying key genetic variants or SNPs that statistical methods might overlook. Statistical methods predominantly identify SNPs with notable effect sizes by conducting association tests on individual genetic variants, one at a time, to determine their relationship with the target phenotype. These genetic variants are then used to create polygenic risk scores (PRSs), estimating an individual's genetic risk for complex diseases like cancer or cardiovascular disorders. Unlike traditional methods, ML algorithms can identify groups of low-risk genetic variants that improve prediction accuracy when combined in a mathematical model. However, the application of ML strategies requires addressing the feature selection challenge to prevent overfitting. Moreover, ensuring the ML model depends on a concise set of genomic variants enhances its clinical applicability, where testing is feasible for only a limited number of SNPs. In this study, we introduce a robust pipeline that applies ML algorithms in combination with feature selection (ML-FS algorithms), aimed at identifying the most significant genomic variants associated with the coronary artery disease (CAD) phenotype. The proposed computational approach was tested on individuals from the UK Biobank, differentiating between CAD and non-CAD individuals within this extensive cohort, and benchmarked against standard PRS-based methodologies like LDpred2 and Lassosum. Our strategy incorporates cross-validation to ensure a more robust evaluation of genomic variant-based prediction models. This method is commonly applied in machine learning strategies but has often been neglected in previous studies assessing the predictive performance of polygenic risk scores. Our results demonstrate that the ML-FS algorithm can identify panels with as few as 50 genetic markers that can achieve approximately 80% accuracy when used in combination with known risk factors. The modest increase in accuracy over PRS performances is noteworthy, especially considering that PRS models incorporate a substantially larger number of genetic variants. This extensive variant selection can pose practical challenges in clinical settings. Additionally, the proposed approach revealed novel CAD-genetic variant associations.
Collapse
Affiliation(s)
- Z Alireza
- Institute of Biomedicine, University of Eastern Finland, 70210, Kuopio, Finland
| | - M Maleeha
- Institute of Biomedicine, University of Eastern Finland, 70210, Kuopio, Finland
| | - M Kaikkonen
- A.I.Virtanen Institute, University of Eastern Finland, 70210, Kuopio, Finland
| | - V Fortino
- Institute of Biomedicine, University of Eastern Finland, 70210, Kuopio, Finland.
| |
Collapse
|
6
|
Asnicar F, Thomas AM, Passerini A, Waldron L, Segata N. Machine learning for microbiologists. Nat Rev Microbiol 2024; 22:191-205. [PMID: 37968359 PMCID: PMC11980903 DOI: 10.1038/s41579-023-00984-1] [Citation(s) in RCA: 44] [Impact Index Per Article: 44.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/03/2023] [Indexed: 11/17/2023]
Abstract
Machine learning is increasingly important in microbiology where it is used for tasks such as predicting antibiotic resistance and associating human microbiome features with complex host diseases. The applications in microbiology are quickly expanding and the machine learning tools frequently used in basic and clinical research range from classification and regression to clustering and dimensionality reduction. In this Review, we examine the main machine learning concepts, tasks and applications that are relevant for experimental and clinical microbiologists. We provide the minimal toolbox for a microbiologist to be able to understand, interpret and use machine learning in their experimental and translational activities.
Collapse
Affiliation(s)
- Francesco Asnicar
- Department of Cellular, Computational and Integrative Biology, University of Trento, Trento, Italy
| | - Andrew Maltez Thomas
- Department of Cellular, Computational and Integrative Biology, University of Trento, Trento, Italy
| | - Andrea Passerini
- Department of Information Engineering and Computer Science, University of Trento, Trento, Italy
| | - Levi Waldron
- Department of Cellular, Computational and Integrative Biology, University of Trento, Trento, Italy.
- Department of Epidemiology and Biostatistics, City University of New York, New York, NY, USA.
| | - Nicola Segata
- Department of Cellular, Computational and Integrative Biology, University of Trento, Trento, Italy.
- Department of Experimental Oncology, European Institute of Oncology IRCCS, Milan, Italy.
| |
Collapse
|
7
|
Alemu A, Åstrand J, Montesinos-López OA, Isidro Y Sánchez J, Fernández-Gónzalez J, Tadesse W, Vetukuri RR, Carlsson AS, Ceplitis A, Crossa J, Ortiz R, Chawade A. Genomic selection in plant breeding: Key factors shaping two decades of progress. MOLECULAR PLANT 2024; 17:552-578. [PMID: 38475993 DOI: 10.1016/j.molp.2024.03.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 01/22/2024] [Accepted: 03/08/2024] [Indexed: 03/14/2024]
Abstract
Genomic selection, the application of genomic prediction (GP) models to select candidate individuals, has significantly advanced in the past two decades, effectively accelerating genetic gains in plant breeding. This article provides a holistic overview of key factors that have influenced GP in plant breeding during this period. We delved into the pivotal roles of training population size and genetic diversity, and their relationship with the breeding population, in determining GP accuracy. Special emphasis was placed on optimizing training population size. We explored its benefits and the associated diminishing returns beyond an optimum size. This was done while considering the balance between resource allocation and maximizing prediction accuracy through current optimization algorithms. The density and distribution of single-nucleotide polymorphisms, level of linkage disequilibrium, genetic complexity, trait heritability, statistical machine-learning methods, and non-additive effects are the other vital factors. Using wheat, maize, and potato as examples, we summarize the effect of these factors on the accuracy of GP for various traits. The search for high accuracy in GP-theoretically reaching one when using the Pearson's correlation as a metric-is an active research area as yet far from optimal for various traits. We hypothesize that with ultra-high sizes of genotypic and phenotypic datasets, effective training population optimization methods and support from other omics approaches (transcriptomics, metabolomics and proteomics) coupled with deep-learning algorithms could overcome the boundaries of current limitations to achieve the highest possible prediction accuracy, making genomic selection an effective tool in plant breeding.
Collapse
Affiliation(s)
- Admas Alemu
- Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden.
| | - Johanna Åstrand
- Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden; Lantmännen Lantbruk, Svalöv, Sweden
| | | | - Julio Isidro Y Sánchez
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223 Madrid, Spain
| | - Javier Fernández-Gónzalez
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223 Madrid, Spain
| | - Wuletaw Tadesse
- International Center for Agricultural Research in the Dry Areas (ICARDA), Rabat, Morocco
| | - Ramesh R Vetukuri
- Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden
| | - Anders S Carlsson
- Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden
| | | | - José Crossa
- International Maize and Wheat Improvement Center (CIMMYT), Km 45, Carretera México-Veracruz, Texcoco, México 52640, Mexico
| | - Rodomiro Ortiz
- Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden.
| | - Aakash Chawade
- Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden
| |
Collapse
|
8
|
Ćeran M, Đorđević V, Miladinović J, Vasiljević M, Đukić V, Ranđelović P, Jaćimović S. Selective Genotyping and Phenotyping for Optimization of Genomic Prediction Models for Populations with Different Diversity. PLANTS (BASEL, SWITZERLAND) 2024; 13:975. [PMID: 38611503 PMCID: PMC11013471 DOI: 10.3390/plants13070975] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Revised: 03/22/2024] [Accepted: 03/24/2024] [Indexed: 04/14/2024]
Abstract
To overcome the different challenges to food security caused by a growing population and climate change, soybean (Glycine max (L.) Merr.) breeders are creating novel cultivars that have the potential to improve productivity while maintaining environmental sustainability. Genomic selection (GS) is an advanced approach that may accelerate the rate of genetic gain in breeding using genome-wide molecular markers. The accuracy of genomic selection can be affected by trait architecture and heritability, marker density, linkage disequilibrium, statistical models, and training set. The selection of a minimal and optimal marker set with high prediction accuracy can lower genotyping costs, computational time, and multicollinearity. Selective phenotyping could reduce the number of genotypes tested in the field while preserving the genetic diversity of the initial population. This study aimed to evaluate different methods of selective genotyping and phenotyping on the accuracy of genomic prediction for soybean yield. The evaluation was performed on three populations: recombinant inbred lines, multifamily diverse lines, and germplasm collection. Strategies adopted for marker selection were as follows: SNP (single nucleotide polymorphism) pruning, estimation of marker effects, randomly selected markers, and genome-wide association study. Reduction of the number of genotypes was performed by selecting a core set from the initial population based on marker data, yet maintaining the original population's genetic diversity. Prediction ability using all markers and genotypes was different among examined populations. The subsets obtained by the model-based strategy can be considered the most suitable for marker selection for all populations. The selective phenotyping based on makers in all cases had higher values of prediction ability compared to minimal values of prediction ability of multiple cycles of random selection, with the highest values of prediction obtained using AN approach and 75% population size. The obtained results indicate that selective genotyping and phenotyping hold great potential and can be integrated as tools for improving or retaining selection accuracy by reducing genotyping or phenotyping costs for genomic selection.
Collapse
Affiliation(s)
- Marina Ćeran
- Laboratory for Biotechnology, Institute of Field and Vegetable Crops, National Institute of the Republic of Serbia, Maksima Gorkog 30, 21000 Novi Sad, Serbia
| | - Vuk Đorđević
- Legumes Department, Institute of Field and Vegetable Crops, National Institute of the Republic of Serbia, Maksima Gorkog 30, 21000 Novi Sad, Serbia; (V.Đ.); (J.M.); (M.V.); (V.Đ.); (P.R.); (S.J.)
| | - Jegor Miladinović
- Legumes Department, Institute of Field and Vegetable Crops, National Institute of the Republic of Serbia, Maksima Gorkog 30, 21000 Novi Sad, Serbia; (V.Đ.); (J.M.); (M.V.); (V.Đ.); (P.R.); (S.J.)
| | - Marjana Vasiljević
- Legumes Department, Institute of Field and Vegetable Crops, National Institute of the Republic of Serbia, Maksima Gorkog 30, 21000 Novi Sad, Serbia; (V.Đ.); (J.M.); (M.V.); (V.Đ.); (P.R.); (S.J.)
| | - Vojin Đukić
- Legumes Department, Institute of Field and Vegetable Crops, National Institute of the Republic of Serbia, Maksima Gorkog 30, 21000 Novi Sad, Serbia; (V.Đ.); (J.M.); (M.V.); (V.Đ.); (P.R.); (S.J.)
| | - Predrag Ranđelović
- Legumes Department, Institute of Field and Vegetable Crops, National Institute of the Republic of Serbia, Maksima Gorkog 30, 21000 Novi Sad, Serbia; (V.Đ.); (J.M.); (M.V.); (V.Đ.); (P.R.); (S.J.)
| | - Simona Jaćimović
- Legumes Department, Institute of Field and Vegetable Crops, National Institute of the Republic of Serbia, Maksima Gorkog 30, 21000 Novi Sad, Serbia; (V.Đ.); (J.M.); (M.V.); (V.Đ.); (P.R.); (S.J.)
| |
Collapse
|
9
|
Tutsoy O, Koç GG. Deep self-supervised machine learning algorithms with a novel feature elimination and selection approaches for blood test-based multi-dimensional health risks classification. BMC Bioinformatics 2024; 25:103. [PMID: 38459463 PMCID: PMC10921629 DOI: 10.1186/s12859-024-05729-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Accepted: 03/04/2024] [Indexed: 03/10/2024] Open
Abstract
BACKGROUND Blood test is extensively performed for screening, diagnoses and surveillance purposes. Although it is possible to automatically evaluate the raw blood test data with the advanced deep self-supervised machine learning approaches, it has not been profoundly investigated and implemented yet. RESULTS This paper proposes deep machine learning algorithms with multi-dimensional adaptive feature elimination, self-feature weighting and novel feature selection approaches. To classify the health risks based on the processed data with the deep layers, four machine learning algorithms having various properties from being utterly model free to gradient driven are modified. CONCLUSIONS The results show that the proposed deep machine learning algorithms can remove the unnecessary features, assign self-importance weights, selects their most informative ones and classify the health risks automatically from the worst-case low to worst-case high values.
Collapse
Affiliation(s)
- Onder Tutsoy
- Adana Alparslan Turkes Science and Technology University, Adana, Turkey.
| | - Gizem Gul Koç
- Adana Alparslan Turkes Science and Technology University, Adana, Turkey
| |
Collapse
|
10
|
Sanchez-Trigo H, Molina-Martínez E, Grimaldi-Puyana M, Sañudo B. Effects of lifestyle behaviours and depressed mood on sleep quality in young adults. A machine learning approach. Psychol Health 2024; 39:128-143. [PMID: 35475409 DOI: 10.1080/08870446.2022.2067331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Accepted: 04/05/2022] [Indexed: 10/18/2022]
Abstract
BACKGROUND Modern lifestyles may lead to high stress levels, frequently associated with mood disorders (e.g. depressed mood) and sleep disturbance. The objective of this study was to develop a machine learning model aimed at identifying risk factors for developing poor sleep quality in young adults. MATERIAL AND METHODS The sample consisted of 383 college-aged students (mean age ± SD: 21 ± 1 years; 61% males). Sleep quality, mood state, physical activity, number of sitting hours, and smartphone use were measured. RESULTS A decision tree algorithm distinguished participants' sleep quality with 74% accuracy using a combination of four features: depressed mood, physical activity, sitting time, and vigour. Together with depressed mood, both physical activity (>6432 metabolic equivalent tasks -METs- per week) and sedentary behaviour (sitting time greater than 7 h/day) were the primary features that could differentiate those with poor sleep quality from those with good sleep quality. CONCLUSIONS We provided a decision tree model with a sensitivity of 90.7% and a specificity of 54.3%, with an AUC of 0.725. These findings could promote improvements in prevention strategies and contribute to the development of meaningful and evidence-based intervention programs.
Collapse
Affiliation(s)
| | | | | | - Borja Sañudo
- Physical Education and Sports Department, University of Seville, Sevilla, Spain
| |
Collapse
|
11
|
Ly QV, Tong NA, Lee BM, Nguyen MH, Trung HT, Le Nguyen P, Hoang THT, Hwang Y, Hur J. Improving algal bloom detection using spectroscopic analysis and machine learning: A case study in a large artificial reservoir, South Korea. THE SCIENCE OF THE TOTAL ENVIRONMENT 2023; 901:166467. [PMID: 37611716 DOI: 10.1016/j.scitotenv.2023.166467] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 08/17/2023] [Accepted: 08/19/2023] [Indexed: 08/25/2023]
Abstract
The prediction of algal blooms using traditional water quality indicators is expensive, labor-intensive, and time-consuming, making it challenging to meet the critical requirement of timely monitoring for prompt management. Using optical measures for forecasting algal blooms is a feasible and useful method to overcome these problems. This study explores the potential application of optical measures to enhance algal bloom prediction in terms of prediction accuracy and workload reduction, aided by machine learning (ML) models. Compared to absorption-derived parameters, commonly used fluorescence indices such as the fluorescence index (FI), humification index (HIX), biological index (BIX), and protein-like component improved the prediction accuracy. However, the prediction accuracy was decreased when all optical indices were considered for computation due to increased noise and uncertainty in the models. With the exception of chemical oxygen demand (COD), this study successfully replaced biochemical oxygen demand (BOD), dissolved organic carbon (DOC), and nutrients with selected fluorescence indices, demonstrating relatively analogous performance in either training or testing data, with consistent and good coefficient of determination (R2) values of approximately 0.85 and 0.74, respectively. Among all models considered, ensemble learning models consistently outperformed conventional regression models and artificial neural networks (ANNs). However, there was a trade-off between accuracy and computation efficiency among the ensemble learning models (i.e., Stacking and XGBoost) for algal bloom prediction. Our study offers a glimpse of the potential application of spectroscopic measures to improve accuracy and efficiency in algal bloom prediction, but further work should be carried out in other water bodies to further validate our proposed hypothesis.
Collapse
Affiliation(s)
- Quang Viet Ly
- Department of Environmental Engineering, Seoul National University of Science and Technology, Seoul 01811, South Korea
| | - Ngoc Anh Tong
- School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi, Vietnam
| | - Bo-Mi Lee
- Water Quality Assessment Research Division, National Institute of Environmental Research, Incheon 22689, South Korea
| | - Minh Hieu Nguyen
- School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi, Vietnam; School of Information and Communication Technology, Griffith University, Gold Coast, Australia
| | - Huynh Thanh Trung
- Ecole Polytechnique Federale de Lausanne, 1015 Lausanne, Switzerland
| | - Phi Le Nguyen
- School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi, Vietnam
| | - Thu-Huong T Hoang
- School of Chemistry and Life Science, Hanoi University of Science and Technology, Hanoi 10000, Vietnam
| | - Yuhoon Hwang
- Department of Environmental Engineering, Seoul National University of Science and Technology, Seoul 01811, South Korea
| | - Jin Hur
- Department of Environment and Energy, Sejong University, Seoul 05006, South Korea.
| |
Collapse
|
12
|
Heinrich F, Lange TM, Kircher M, Ramzan F, Schmitt AO, Gültas M. Exploring the potential of incremental feature selection to improve genomic prediction accuracy. Genet Sel Evol 2023; 55:78. [PMID: 37946104 PMCID: PMC10634161 DOI: 10.1186/s12711-023-00853-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Accepted: 11/02/2023] [Indexed: 11/12/2023] Open
Abstract
BACKGROUND The ever-increasing availability of high-density genomic markers in the form of single nucleotide polymorphisms (SNPs) enables genomic prediction, i.e. the inference of phenotypes based solely on genomic data, in the field of animal and plant breeding, where it has become an important tool. However, given the limited number of individuals, the abundance of variables (SNPs) can reduce the accuracy of prediction models due to overfitting or irrelevant SNPs. Feature selection can help to reduce the number of irrelevant SNPs and increase the model performance. In this study, we investigated an incremental feature selection approach based on ranking the SNPs according to the results of a genome-wide association study that we combined with random forest as a prediction model, and we applied it on several animal and plant datasets. RESULTS Applying our approach to different datasets yielded a wide range of outcomes, i.e. from a substantial increase in prediction accuracy in a few cases to minor improvements when only a fraction of the available SNPs were used. Compared with models using all available SNPs, our approach was able to achieve comparable performances with a considerably reduced number of SNPs in several cases. Our approach showcased state-of-the-art efficiency and performance while having a faster computation time. CONCLUSIONS The results of our study suggest that our incremental feature selection approach has the potential to improve prediction accuracy substantially. However, this gain seems to depend on the genomic data used. Even for datasets where the number of markers is smaller than the number of individuals, feature selection may still increase the performance of the genomic prediction. Our approach is implemented in R and is available at https://github.com/FelixHeinrich/GP_with_IFS/ .
Collapse
Affiliation(s)
- Felix Heinrich
- Breeding Informatics Group, Department of Animal Sciences, Georg-August University, Margarethe von Wrangell-Weg 7, 37075, Göttingen, Germany.
| | - Thomas Martin Lange
- Breeding Informatics Group, Department of Animal Sciences, Georg-August University, Margarethe von Wrangell-Weg 7, 37075, Göttingen, Germany
| | - Magdalena Kircher
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Bünteweg 17p, 30559, Hannover, Germany
| | - Faisal Ramzan
- Institute of Animal and Dairy Sciences, University of Agriculture Faisalabad, Jail Road, 38000, Faisalabad, Pakistan
| | - Armin Otto Schmitt
- Breeding Informatics Group, Department of Animal Sciences, Georg-August University, Margarethe von Wrangell-Weg 7, 37075, Göttingen, Germany
- Center for Integrated Breeding Research (CiBreed), Georg-August University, Albrecht-Thaer-Weg 3, 37075, Göttingen, Germany
| | - Mehmet Gültas
- Center for Integrated Breeding Research (CiBreed), Georg-August University, Albrecht-Thaer-Weg 3, 37075, Göttingen, Germany.
- Faculty of Agriculture, South Westphalia University of Applied Sciences, 59494, Soest, Germany.
| |
Collapse
|
13
|
Zhang Y, Zhang M, Ye J, Xu Q, Feng Y, Xu S, Hu D, Wei X, Hu P, Yang Y. Integrating genome-wide association study into genomic selection for the prediction of agronomic traits in rice ( Oryza sativa L.). MOLECULAR BREEDING : NEW STRATEGIES IN PLANT IMPROVEMENT 2023; 43:81. [PMID: 37965378 PMCID: PMC10641074 DOI: 10.1007/s11032-023-01423-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 10/09/2023] [Indexed: 11/16/2023]
Abstract
Accurately identifying varieties with targeted agronomic traits was thought to contribute to genetic selection and accelerate rice breeding progress. Genomic selection (GS) is a promising technique that uses markers covering the whole genome to predict the genomic-estimated breeding values (GEBV), with the ability to select before phenotypes are measured. To choose the appropriate GS models for breeding work, we analyzed the predictability of nine agronomic traits measured from a population of 459 diverse rice varieties. By the comparison of eight representative GS models, we found that the prediction accuracies ranged from 0.407 to 0.896, with reproducing kernel Hilbert space (RKHS) having the highest predictive ability in most traits. Further results demonstrated the predictivity of GS is altered by several factors. Moreover, we assessed the method of integrating genome-wide association study (GWAS) into various GS models. The predictabilities of GS combined peak-associated markers generated from six different GWAS models were significantly different; a recommendation of Mixed Linear Model (MLM)-RKHS was given for the GWAS-GS-integrated prediction. Finally, based on the above result, we experimented with applying the P-values obtained from optimal GWAS models into ridge regression best linear unbiased prediction (rrBLUP), which benefited the low predictive traits in rice. Supplementary Information The online version contains supplementary material available at 10.1007/s11032-023-01423-y.
Collapse
Affiliation(s)
- Yuanyuan Zhang
- Zhejiang Lab, Hangzhou, 311121 China
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
| | - Mengchen Zhang
- Zhejiang Lab, Hangzhou, 311121 China
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, 572024 China
| | - Junhua Ye
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
| | - Qun Xu
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
| | - Yue Feng
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, 572024 China
| | - Siliang Xu
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
| | - Dongxiu Hu
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
| | - Xinghua Wei
- Zhejiang Lab, Hangzhou, 311121 China
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, 572024 China
| | - Peisong Hu
- Zhejiang Lab, Hangzhou, 311121 China
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
| | - Yaolong Yang
- Zhejiang Lab, Hangzhou, 311121 China
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, 572024 China
| |
Collapse
|
14
|
Gebski V, Silva SSM, Byth K, Jenkins A, Keech A. Improving efficiency of fitting Cox proportional hazards models for time-to-event outcomes in genome-wide association studies (GWAS). BIOINFORMATICS ADVANCES 2023; 3:vbad148. [PMID: 37928342 PMCID: PMC10625458 DOI: 10.1093/bioadv/vbad148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 10/02/2023] [Accepted: 10/11/2023] [Indexed: 11/07/2023]
Abstract
Summary Technologies identifying single nucleotide polymorphisms (SNPs) in DNA sequencing yield an avalanche of data requiring analysis and interpretation. Standard methods may require many weeks of processing time. The use of statistical methods requiring data sorting, matrix inversions of a high-dimension and replication in subsets of the data on multiple outcomes exacerbate these times.A method which reduces the computational time in problems with time-to-event outcomes and hundreds of thousands/millions of SNPs using Cox-Snell residuals after fitting the Cox proportional hazards model (PH) to a fixed set of concomitant variables is proposed. This yields coefficients for SNP effect from a Cox-Snell adjusted Poisson model and shows a high concordance to the adjusted PH model.The method is illustrated with a sample of 10 000 SNPs from a genome-wide association study in a diabetic population. The gain in processing efficiency using the proposed method based on Poisson modelling can be as high as 62%. This could result in saving of over three weeks processing time if 5 million SNPs require analysis. The method involves only a single predictor variable (SNP), offering a simpler, computationally more stable approach to examining and identifying SNP patterns associated with the outcome(s) allowing for a faster development of genetic signatures. Use of deviance residuals from the PH model to screen SNPs demonstrates a large discordance rate at a 0.2% threshold of concordance. This rate is 15 times larger than that based on the Cox-Snell residuals from the Cox-Snell adjusted Poisson model. Availability and implementation The method is simple to implement as the procedures are available in most statistical packges. The approach involves obtaining Cox-Snell residuals from a PH model, to a binary time-to-event outcome, for factors which need to be common when assessing each SNP. Each SNP is then fitted as a predictor to the outcome of interest using a Poisson model with the Cox-Snell as the exposure variable.
Collapse
Affiliation(s)
- Val Gebski
- NHMRC Clinical Trials Centre, University of Sydney, Camperdown, NSW 1450, Australia
| | - S Sandun M Silva
- NHMRC Clinical Trials Centre, University of Sydney, Camperdown, NSW 1450, Australia
| | - Karen Byth
- NHMRC Clinical Trials Centre, University of Sydney, Camperdown, NSW 1450, Australia
| | - Alicia Jenkins
- NHMRC Clinical Trials Centre, University of Sydney, Camperdown, NSW 1450, Australia
| | - Anthony Keech
- NHMRC Clinical Trials Centre, University of Sydney, Camperdown, NSW 1450, Australia
| |
Collapse
|
15
|
Lee YS, Oh JD, Lee JY, Shin D. A genomic estimated breeding value-assisted reduction method of single nucleotide polymorphism sets: a novel approach for determining the cutoff thresholds in genome-wide association studies and best linear unbiased prediction. Anim Cells Syst (Seoul) 2023; 27:180-186. [PMID: 37674816 PMCID: PMC10478620 DOI: 10.1080/19768354.2023.2250841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Revised: 06/16/2023] [Accepted: 07/20/2023] [Indexed: 09/08/2023] Open
Abstract
Traditionally, the p-value is the criterion for the cutoff threshold to determine significant markers in genome-wide association studies (GWASs). Choosing the best subset of markers for the best linear unbiased prediction (BLUP) for improved prediction ability (PA) has become an interesting issue. However, when dealing with many traits having the same marker information, the p-values' themselves cannot be used as an obvious solution for having a confidence in GWAS and BLUP. We thus suggest a genomic estimated breeding value-assisted reduction method of the single nucleotide polymorphism (SNP) set (GARS) to address these difficulties. GARS is a BLUP-based SNP set decision presentation. The samples were Landrace pigs and the traits used were back fat thickness (BF) and daily weight gain (DWG). The prediction abilities (PAs) for BF and DWG for the entire SNP set were 0.8 and 0.8, respectively. By using the correlation between genomic estimated breeding values (GEBVs) and phenotypic values, selecting the cutoff threshold in GWAS and the best SNP subsets in BLUP was plausible as defined by GARS method. 6,000 SNPs in BF and 4,000 SNPs in DWG were considered as adequate thresholds. Gene Ontology (GO) analysis using the GARS results of the BF indicated neuron projection development as the notable GO term, whereas for the DWG, the main GO terms were nervous system development and cell adhesion.
Collapse
Affiliation(s)
- Young-Sup Lee
- Department of Animal Biotechnology, Jeonbuk National University, Jeonju, Republic of Korea
| | - Jae-Don Oh
- Department of Animal Biotechnology, Jeonbuk National University, Jeonju, Republic of Korea
| | - Jun-Yeong Lee
- School of Life Sciences, BK21 FOUR KNU Creative BioResearch Group, Kyungpook National University, Daegu, Republic of Korea
| | - Donghyun Shin
- Department of Agricultural Convergence Technology, Jeonbuk National University, Jeonju, Republic of Korea
| |
Collapse
|
16
|
Mohseni N, Ghaniee Zarich M, Afshar S, Hosseini M. Identification of Novel Biomarkers for Response to Preoperative Chemoradiation in Locally Advanced Rectal Cancer with Genetic Algorithm-Based Gene Selection. J Gastrointest Cancer 2023; 54:937-950. [PMID: 36534304 DOI: 10.1007/s12029-022-00873-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/05/2022] [Indexed: 12/23/2022]
Abstract
BACKGROUND The conventional treatment for patients with locally advanced colorectal tumors is preoperative chemo-radiotherapy (PCRT) preceding surgery. This treatment strategy has some long-term side effects, and some patients do not respond to it. Therefore, an evaluation of biomarkers that may help predict patients' response to PCRT is essential. METHODS We took advantage of genetic algorithm to search the space of possible combinations of features to choose subsets of genes that would yield convenient performance in differentiating PCRT responders from non-responders using a logistic regression model as our classifier. RESULTS We developed two gene signatures; first, to achieve the maximum prediction accuracy, the algorithm yielded 39 genes, and then, aiming to reduce the feature numbers as much as possible (while maintaining acceptable performance), a 5-gene signature was chosen. The performance of the two gene signatures was (accuracy = 0.97 and 0.81, sensitivity = 0.96 and 0.83, and specificity = 86 and 0.77) using a logistic regression classifier. Through analyzing bias and variance decomposition of the model error, we further investigated the involved genes by discovering and validating another 28-gene signature which possibly points towards two different sub-systems involved in the response of the patients to treatment. CONCLUSIONS Using genetic algorithm as our gene selection method, we have identified two groups of genes that can differentiate PCRT responders from non-responders in patients of the studied dataset with considerable performance. IMPACT After passing standard requirements, our gene signatures may be applicable as a robust and effective PCRT response prediction tool for colorectal cancer patients in clinical settings and may also help future studies aiming to further investigate involved pathways gain a clearer picture for the course of their research.
Collapse
Affiliation(s)
- Nima Mohseni
- Department of Biology, Faculty of Science, Lund University, Skåne, Sweden
| | | | - Saeid Afshar
- Research Center for Molecular Medicine, Hamadan University of Medical Sciences, Hamadan, Iran.
| | | |
Collapse
|
17
|
Mahmoudi A, Butler AE, Banach M, Jamialahmadi T, Sahebkar A. Identification of Potent Small-Molecule PCSK9 Inhibitors Based on Quantitative Structure-Activity Relationship, Pharmacophore Modeling, and Molecular Docking Procedure. Curr Probl Cardiol 2023; 48:101660. [PMID: 36841313 DOI: 10.1016/j.cpcardiol.2023.101660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Accepted: 02/17/2023] [Indexed: 02/27/2023]
Abstract
The leading cause of atherosclerotic cardiovascular disease (ASCVD) is elevated low-density lipoprotein cholesterol (LDL-C). Proprotein convertase subtilisin/kexin type 9 (PCSK9) attaches to the domain of LDL receptor (LDLR), diminishing LDL-C influx and LDLR cell surface presentation in hepatocytes, resulting in higher circulating LDL-C levels. PCSK9 dysfunction has been linked to lower levels of plasma LDLC and a decreased risk of coronary heart disease (CHD). Herein, using virtual screening tools, we aimed to identify a potent small-molecule PCSK9 inhibitor in compounds that are currently being studied in clinical trials. We first performed chemical absorption, distribution, metabolism, excretion, and toxicity (ADMET) filtering of 9800 clinical trial compounds obtained from the ZINC 15 database using Lipinski's rule of 5 and achieved 3853 compounds. Two-dimensional (2D) quantitative structure-activity relationship (QSAR) was initiated by computing molecular descriptors and selecting important descriptors of 23 PCSK9 inhibitors. Multivariate calibration was performed with the partial least square regression (PLS) method with 18 compounds for training to design the QSAR model and 5 compounds for the test set to assess the model. The best latent variables (LV) (LV=6) with the lowest value of Root-Mean-Square Error of Cross-Validation (RMSECV) of 0.48 and leave-one-out cross-validation correlation coefficient (R2CV) = 0.83 were obtained for the QSAR model. The low RMSEC (0.21) with high R²cal (0.966) indicates the probability of fit between the experimental data and the calibration model. Using QSAR analysis of 3853 compounds, 2635 had a pIC50<1 and were considered for pharmacophore screening. The PHASE module (a complete package for pharmacophore modeling) designed the pharmacophore hypothesis through multiple ligands. The top 14 compounds (pIC50>1) were defined as active, whereas 9 (pIC50<1) were considered as an inactive set. Three five-point pharmacophore hypotheses achieved the highest score: DHHRR1, DHHRR2, and DHRRR1. The highest and best model with survival scores (5.365) was DHHRR1, comprising 1 hydrogen donor (D), 2 hydrophobic groups (H), and 2 rings of aromatic (R) features. We selected the molecules with a higher 1.5 fitness score (257 compounds) in pharmacophore screening (DHHRR1) for molecular docking screening. Molecular docking indicates that ZINC000051951669, with a binding affinity: of -13.2 kcal/mol and 2 H-bonds, has the highest binding to the PCSK9 protein. ZINC000011726230 with energy binding: -11.4 kcal/mol and 3 H-bonds, ZINC000068248147 with binding affinity: -10.7 kcal/mol and 1 H-bond, ZINC000029134440 with a binding affinity: -10.6 kcal/mol and 4 H-bonds were ranked next, respectively. To conclude, the archived molecules identified as inhibitory PCSK9 candidates, and especially ZINC000051951669 may therefore significantly inhibit PCSK9 and should be considered in the newly designed trials.
Collapse
Affiliation(s)
- Ali Mahmoudi
- Student Research Committee, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran; Department of Medical Biotechnology and Nanotechnology, Faculty of Medicine, Mashhad University of Medical Sciences, Iran
| | - Alexandra E Butler
- Research Department, Royal College of Surgeons in Ireland Bahrain, Adliya, Bahrain
| | - Maciej Banach
- Department of Preventive Cardiology and Lipidology, Medical University of Lodz (MUL) Lodz, Poland; Cardiovascular Research Centre, University of Zielona Gora, Zielona Gora, Poland; Department of Cardiology and Congenital Diseases of Adults, Polish Mother's Memorial Hospital Research institute (PMMHRI), Lodz, Poland; Ciccarone Center for the Prevention of Cardiovascular Disease, Johns Hopkins University School of Medicine, Baltimore, MD
| | - Tannaz Jamialahmadi
- Applied Biomedical Research Center, Mashhad University of Medical Sciences, Mashhad, Iran; Surgical Oncology Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Amirhossein Sahebkar
- Applied Biomedical Research Center, Mashhad University of Medical Sciences, Mashhad, Iran; Biotechnology Research Center, Pharmaceutical Technology Institute, Mashhad University of Medical Sciences, Mashhad, Iran; School of Medicine, The University of Western Australia, Perth, Australia; Department of Biotechnology, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad, Iran.
| |
Collapse
|
18
|
Mowlaei ME, Shi X. FSF-GA: A Feature Selection Framework for Phenotype Prediction Using Genetic Algorithms. Genes (Basel) 2023; 14:genes14051059. [PMID: 37239419 DOI: 10.3390/genes14051059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 05/04/2023] [Accepted: 05/06/2023] [Indexed: 05/28/2023] Open
Abstract
(1) Background: Phenotype prediction is a pivotal task in genetics in order to identify how genetic factors contribute to phenotypic differences. This field has seen extensive research, with numerous methods proposed for predicting phenotypes. Nevertheless, the intricate relationship between genotypes and complex phenotypes, including common diseases, has resulted in an ongoing challenge to accurately decipher the genetic contribution. (2) Results: In this study, we propose a novel feature selection framework for phenotype prediction utilizing a genetic algorithm (FSF-GA) that effectively reduces the feature space to identify genotypes contributing to phenotype prediction. We provide a comprehensive vignette of our method and conduct extensive experiments using a widely used yeast dataset. (3) Conclusions: Our experimental results show that our proposed FSF-GA method delivers comparable phenotype prediction performance as compared to baseline methods, while providing features selected for predicting phenotypes. These selected feature sets can be used to interpret the underlying genetic architecture that contributes to phenotypic variation.
Collapse
Affiliation(s)
- Mohammad Erfan Mowlaei
- Department of Computer and Information Sciences, Temple University, 925 N. 12th Street, Philadelphia, PA 19122, USA
| | - Xinghua Shi
- Department of Computer and Information Sciences, Temple University, 925 N. 12th Street, Philadelphia, PA 19122, USA
| |
Collapse
|
19
|
Kisiel A, Krzemińska A, Cembrowska-Lech D, Miller T. Data Science and Plant Metabolomics. Metabolites 2023; 13:metabo13030454. [PMID: 36984894 PMCID: PMC10054611 DOI: 10.3390/metabo13030454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 03/16/2023] [Accepted: 03/17/2023] [Indexed: 03/30/2023] Open
Abstract
The study of plant metabolism is one of the most complex tasks, mainly due to the huge amount and structural diversity of metabolites, as well as the fact that they react to changes in the environment and ultimately influence each other. Metabolic profiling is most often carried out using tools that include mass spectrometry (MS), which is one of the most powerful analytical methods. All this means that even when analyzing a single sample, we can obtain thousands of data. Data science has the potential to revolutionize our understanding of plant metabolism. This review demonstrates that machine learning, network analysis, and statistical modeling are some techniques being used to analyze large quantities of complex data that provide insights into plant development, growth, and how they interact with their environment. These findings could be key to improving crop yields, developing new forms of plant biotechnology, and understanding the relationship between plants and microbes. It is also necessary to consider the constraints that come with data science such as quality and availability of data, model complexity, and the need for deep knowledge of the subject in order to achieve reliable outcomes.
Collapse
Affiliation(s)
- Anna Kisiel
- Institute of Marine and Environmental Sciences, University of Szczecin, Wąska 13, 71-415 Szczecin, Poland
- Polish Society of Bioinformatics and Data Science BIODATA, Popiełuszki 4c, 71-214 Szczecin, Poland
| | - Adrianna Krzemińska
- Polish Society of Bioinformatics and Data Science BIODATA, Popiełuszki 4c, 71-214 Szczecin, Poland
| | - Danuta Cembrowska-Lech
- Polish Society of Bioinformatics and Data Science BIODATA, Popiełuszki 4c, 71-214 Szczecin, Poland
- Department of Physiology and Biochemistry, Institute of Biology, University of Szczecin, Felczaka 3c, 71-412 Szczecin, Poland
| | - Tymoteusz Miller
- Institute of Marine and Environmental Sciences, University of Szczecin, Wąska 13, 71-415 Szczecin, Poland
- Polish Society of Bioinformatics and Data Science BIODATA, Popiełuszki 4c, 71-214 Szczecin, Poland
| |
Collapse
|
20
|
Marjit S, Bhattacharyya T, Chatterjee B, Sarkar R. Simulated annealing aided genetic algorithm for gene selection from microarray data. Comput Biol Med 2023; 158:106854. [PMID: 37023541 DOI: 10.1016/j.compbiomed.2023.106854] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Revised: 02/26/2023] [Accepted: 03/30/2023] [Indexed: 04/03/2023]
Abstract
In recent times, microarray gene expression datasets have gained significant popularity due to their usefulness to identify different types of cancer directly through bio-markers. These datasets possess a high gene-to-sample ratio and high dimensionality, with only a few genes functioning as bio-markers. Consequently, a significant amount of data is redundant, and it is essential to filter out important genes carefully. In this paper, we propose the Simulated Annealing aided Genetic Algorithm (SAGA), a meta-heuristic approach to identify informative genes from high-dimensional datasets. SAGA utilizes a two-way mutation-based Simulated Annealing (SA) as well as Genetic Algorithm (GA) to ensure a good trade-off between exploitation and exploration of the search space, respectively. The naive version of GA often gets stuck in a local optimum and depends on the initial population, leading to premature convergence. To address this, we have blended a clustering-based population generation with SA to distribute the initial population of GA over the entire feature space. To further enhance the performance, we reduce the initial search space by a score-based filter approach called the Mutually Informed Correlation Coefficient (MICC). The proposed method is evaluated on 6 microarray and 6 omics datasets. Comparison of SAGA with contemporary algorithms has shown that SAGA performs much better than its peers. Our code is available at https://github.com/shyammarjit/SAGA.
Collapse
|
21
|
Farooq M, van Dijk AD, Nijveen H, Mansoor S, de Ridder D. Genomic prediction in plants: opportunities for ensemble machine learning based approaches. F1000Res 2023; 11:802. [PMID: 37035464 PMCID: PMC10080209 DOI: 10.12688/f1000research.122437.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/04/2023] [Indexed: 01/12/2023] Open
Abstract
Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might depend on a plethora of factors including sample size, number of markers, population structure and genetic architecture. Methods: Here, we investigate which problem and dataset characteristics are related to good performance of ML methods for genomic prediction. We compare the predictive performance of two frequently used ensemble ML methods (Random Forest and Extreme Gradient Boosting) with parametric methods including genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space regression (RKHS), BayesA and BayesB. To explore problem characteristics, we use simulated and real plant traits under different genetic complexity levels determined by the number of Quantitative Trait Loci (QTLs), heritability (h2 and h2e), population structure and linkage disequilibrium between causal nucleotides and other SNPs. Results: Decision tree based ensemble ML methods are a better choice for nonlinear phenotypes and are comparable to Bayesian methods for linear phenotypes in the case of large effect Quantitative Trait Nucleotides (QTNs). Furthermore, we find that ML methods are susceptible to confounding due to population structure but less sensitive to low linkage disequilibrium than linear parametric methods. Conclusions: Overall, this provides insights into the role of ML in GP as well as guidelines for practitioners.
Collapse
Affiliation(s)
- Muhammad Farooq
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
- Molecular Virology and Gene Silencing Lab, Agricultural Biotechnology Division, National Institute for Biotechnology and Genetic Engineering (NIBGE), Faisalabad, Punjab, 38000, Pakistan
| | - Aalt D.J. van Dijk
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
| | - Harm Nijveen
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
| | - Shahid Mansoor
- Molecular Virology and Gene Silencing Lab, Agricultural Biotechnology Division, National Institute for Biotechnology and Genetic Engineering (NIBGE), Faisalabad, Punjab, 38000, Pakistan
| | - Dick de Ridder
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
| |
Collapse
|
22
|
Zhang S, Wang J, Li X, Liang Y. M6A-GSMS: Computational identification of N 6-methyladenosine sites with GBDT and stacking learning in multiple species. J Biomol Struct Dyn 2022; 40:12380-12391. [PMID: 34459713 DOI: 10.1080/07391102.2021.1970628] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
N6-methyladenosine (m6A) is one of the most abundant forms of RNA methylation modifications currently known. It involves a wide range of biological processes, including degradation, stability, alternative splicing, etc. Therefore, the development of convenient and efficient m6A prediction technologies are urgent. In this work, a novel predictor based on GBDT and stacking learning is developed to identify m6A sites, which is called M6A-GSMS. To achieve accurate prediction, we explore RNA sequence information from four aspects: correlation, structure, physicochemical properties and pseudo ribonucleic acid composition. After using the GBDT algorithm for feature selection, a stacking model is constructed by combining seven basic classifiers. Compared with other state-of-the-art methods, the results show that M6A-GSMS can obtain excellent performance for identifying the m6A sites. The prediction accuracy of A.thaliana, D.melanogaster, M.musculus, S.cerevisiae and Human reaches 88.4%, 60.8%, 80.5%, 92.4% and 61.8%, respectively. This method provides an effective prediction for the investigation of m6A sites. In addition, all the datasets and codes are currently available at https://github.com/Wang-Jinyue/M6A-GSMS.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, P. R. China
| | - Jinyue Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, P. R. China
| | - Xinjie Li
- School of Mathematics and Statistics, Xidian University, Xi'an, P. R. China
| | - Yunyun Liang
- School of Science, Xi'an Polytechnic University, Xi'an, P. R. China
| |
Collapse
|
23
|
Kim M, Bae J, Wang B, Ko H, Lim JS. Feature Selection Method Using Multi-Agent Reinforcement Learning Based on Guide Agents. SENSORS (BASEL, SWITZERLAND) 2022; 23:98. [PMID: 36616694 PMCID: PMC9823489 DOI: 10.3390/s23010098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 12/12/2022] [Accepted: 12/16/2022] [Indexed: 06/17/2023]
Abstract
In this study, we propose a method to automatically find features from a dataset that are effective for classification or prediction, using a new method called multi-agent reinforcement learning and a guide agent. Each feature of the dataset has one of the main and guide agents, and these agents decide whether to select a feature. Main agents select the optimal features, and guide agents present the criteria for judging the main agents' actions. After obtaining the main and guide rewards for the features selected by the agents, the main agent that behaves differently from the guide agent updates their Q-values by calculating the learning reward delivered to the main agents. The behavior comparison helps the main agent decide whether its own behavior is correct, without using other algorithms. After performing this process for each episode, the features are finally selected. The feature selection method proposed in this study uses multiple agents, reducing the number of actions each agent can perform and finding optimal features effectively and quickly. Finally, comparative experimental results on multiple datasets show that the proposed method can select effective features for classification and increase classification accuracy.
Collapse
Affiliation(s)
- Minwoo Kim
- Department of Computer Science, Gachon University, Sujeong-gu, Seongnam-si 13557, Gyeonggi-do, Republic of Korea
- AI Team, 2nd R&D Center, MEZOO Co., Ltd., Gieopdosi-ro 200, Jijeong-myeon, Wonju-si 26354, Gangwon-do, Republic of Korea
| | - Jinhee Bae
- Department of Computer Science, University of Southern California, Los Angeles, CA 90007, USA
| | - Bohyun Wang
- Department of Computer Science, Gachon University, Sujeong-gu, Seongnam-si 13557, Gyeonggi-do, Republic of Korea
| | - Hansol Ko
- Department of Computer Science, Gachon University, Sujeong-gu, Seongnam-si 13557, Gyeonggi-do, Republic of Korea
| | - Joon S. Lim
- Department of Computer Science, Gachon University, Sujeong-gu, Seongnam-si 13557, Gyeonggi-do, Republic of Korea
| |
Collapse
|
24
|
Wilkinson MJ, Yamashita R, James ME, Bally ISE, Dillon NL, Ali A, Hardner CM, Ortiz-Barrientos D. The influence of genetic structure on phenotypic diversity in the Australian mango (Mangifera indica) gene pool. Sci Rep 2022; 12:20614. [PMID: 36450793 PMCID: PMC9712640 DOI: 10.1038/s41598-022-24800-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 11/21/2022] [Indexed: 12/11/2022] Open
Abstract
Genomic selection is a promising breeding technique for tree crops to accelerate the development of new cultivars. However, factors such as genetic structure can create spurious associations between genotype and phenotype due to the shared history between populations with different trait values. Genetic structure can therefore reduce the accuracy of the genotype to phenotype map, a fundamental requirement of genomic selection models. Here, we employed 272 single nucleotide polymorphisms from 208 Mangifera indica accessions to explore whether the genetic structure of the Australian mango gene pool explained variation in trunk circumference, fruit blush colour and intensity. Multiple population genetic analyses indicate the presence of four genetic clusters and show that the most genetically differentiated cluster contains accessions imported from Southeast Asia (mainly those from Thailand). We find that genetic structure was strongly associated with three traits: trunk circumference, fruit blush colour and intensity in M. indica. This suggests that the history of these accessions could drive spurious associations between loci and key mango phenotypes in the Australian mango gene pool. Incorporating such genetic structure in associations between genotype and phenotype can improve the accuracy of genomic selection, which can assist the future development of new cultivars.
Collapse
Affiliation(s)
- Melanie J Wilkinson
- School of Biological Sciences, The University of Queensland, Brisbane, QLD, 4072, Australia.
- Australian Research Council Centre of Excellence for Plant Success in Nature and Agriculture, The University of Queensland, Brisbane, QLD, 4072, Australia.
| | - Risa Yamashita
- Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Maddie E James
- School of Biological Sciences, The University of Queensland, Brisbane, QLD, 4072, Australia
- Australian Research Council Centre of Excellence for Plant Success in Nature and Agriculture, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Ian S E Bally
- Queensland Department of Agriculture and Fisheries, Mareeba, QLD, 4880, Australia
| | - Natalie L Dillon
- Queensland Department of Agriculture and Fisheries, Mareeba, QLD, 4880, Australia
| | - Asjad Ali
- Queensland Department of Agriculture and Fisheries, Mareeba, QLD, 4880, Australia
| | - Craig M Hardner
- Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Daniel Ortiz-Barrientos
- School of Biological Sciences, The University of Queensland, Brisbane, QLD, 4072, Australia
- Australian Research Council Centre of Excellence for Plant Success in Nature and Agriculture, The University of Queensland, Brisbane, QLD, 4072, Australia
| |
Collapse
|
25
|
Do open citations give insights on the qualitative peer-review evaluation in research assessments? An analysis of the Italian National Scientific Qualification. Scientometrics 2022. [DOI: 10.1007/s11192-022-04581-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
AbstractIn the past, several works have investigated ways for combining quantitative and qualitative methods in research assessment exercises. Indeed, the Italian National Scientific Qualification (NSQ), i.e. the national assessment exercise which aims at deciding whether a scholar can apply to professorial academic positions as Associate Professor and Full Professor, adopts a quantitative and qualitative evaluation process: it makes use of bibliometrics followed by a peer-review process of candidates’ CVs. The NSQ divides academic disciplines into two categories, i.e. citation-based disciplines (CDs) and non-citation-based disciplines (NDs), a division that affects the metrics used for assessing the candidates of that discipline in the first part of the process, which is based on bibliometrics. In this work, we aim at exploring whether citation-based metrics, calculated only considering open bibliographic and citation data, can support the human peer-review of NDs and yield insights on how it is conducted. To understand if and what citation-based (and, possibly, other) metrics provide relevant information, we created a series of machine learning models to replicate the decisions of the NSQ committees. As one of the main outcomes of our study, we noticed that the strength of the citational relationship between the candidate and the commission in charge of assessing his/her CV seems to play a role in the peer-review phase of the NSQ of NDs.
Collapse
|
26
|
Li Z, Li W, Yan W, Zhang R, Xie S. Data-driven learning to identify biomarkers in bipolar disorder. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 226:107112. [PMID: 36156436 DOI: 10.1016/j.cmpb.2022.107112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 06/09/2022] [Accepted: 09/04/2022] [Indexed: 06/16/2023]
Abstract
BACKGROUND AND OBJECTIVE Bipolar disorder (BD) is one of the primary causes of disability globally and can be easily misdiagnosed as schizophrenia or major depression due to their similar symptoms. Hence, it is of great significance to explore the pathogenesis of BD. Statistical analysis is currently the most common method for exploring the neuropathological mechanisms of psychiatric disorders. However, this method only considers the relationship between groups and does not reflect the individual-level diagnosis. Therefore, we developed machine learning algorithms to measure pathological brain changes in psychiatric disorders. METHODS An autoencoder and a feature selection method are proposed to identify the abnormal structural patterns of BD in this study. The autoencoder was constructed using structural imaging data from 1113 healthy controls, which aims to define the normal range of anatomical deviations to distinguish healthy individuals from BD patients. The biomarkers of BD were identified by the reconstruction errors in each brain region. The proposed feature selection (FS)-select framework aimed to determine the optimal FS method and identify the most reproducible feature associated with BD. RESULTS We found that the left orbital region of the middle frontal gyrus had the greatest difference between healthy controls and BD patients using a trained autoencoder. The most reproducible feature was the left orbital region of the middle frontal gyrus by FS-select framework when using the different cross-validation strategies. CONCLUSIONS A consistent result was obtained from the above two proposed methods wherein a significant difference between healthy controls and BD patients was identified in the left orbital region of the middle frontal gyrus.
Collapse
Affiliation(s)
- Zhuangzhuang Li
- College of Telecommunication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| | - Wenmei Li
- School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing 210023, China.
| | - Wei Yan
- Department of Psychiatry affiliated Nanjing Brain Hospital, Nanjing Medical University, Nanjing 210029, China.
| | - Rongrong Zhang
- Department of Psychiatry affiliated Nanjing Brain Hospital, Nanjing Medical University, Nanjing 210029, China
| | - Shiping Xie
- Department of Psychiatry affiliated Nanjing Brain Hospital, Nanjing Medical University, Nanjing 210029, China
| |
Collapse
|
27
|
A divide-and-conquer approach for genomic prediction in rubber tree using machine learning. Sci Rep 2022; 12:18023. [PMID: 36289298 PMCID: PMC9605989 DOI: 10.1038/s41598-022-20416-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 09/13/2022] [Indexed: 01/20/2023] Open
Abstract
Rubber tree (Hevea brasiliensis) is the main feedstock for commercial rubber; however, its long vegetative cycle has hindered the development of more productive varieties via breeding programs. With the availability of H. brasiliensis genomic data, several linkage maps with associated quantitative trait loci have been constructed and suggested as a tool for marker-assisted selection. Nonetheless, novel genomic strategies are still needed, and genomic selection (GS) may facilitate rubber tree breeding programs aimed at reducing the required cycles for performance assessment. Even though such a methodology has already been shown to be a promising tool for rubber tree breeding, increased model predictive capabilities and practical application are still needed. Here, we developed a novel machine learning-based approach for predicting rubber tree stem circumference based on molecular markers. Through a divide-and-conquer strategy, we propose a neural network prediction system with two stages: (1) subpopulation prediction and (2) phenotype estimation. This approach yielded higher accuracies than traditional statistical models in a single-environment scenario. By delivering large accuracy improvements, our methodology represents a powerful tool for use in Hevea GS strategies. Therefore, the incorporation of machine learning techniques into rubber tree GS represents an opportunity to build more robust models and optimize Hevea breeding programs.
Collapse
|
28
|
Dutta A, Hasan MK, Ahmad M, Awal MA, Islam MA, Masud M, Meshref H. Early Prediction of Diabetes Using an Ensemble of Machine Learning Models. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:ijerph191912378. [PMID: 36231678 PMCID: PMC9566114 DOI: 10.3390/ijerph191912378] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 09/20/2022] [Accepted: 09/24/2022] [Indexed: 05/15/2023]
Abstract
Diabetes is one of the most rapidly spreading diseases in the world, resulting in an array of significant complications, including cardiovascular disease, kidney failure, diabetic retinopathy, and neuropathy, among others, which contribute to an increase in morbidity and mortality rate. If diabetes is diagnosed at an early stage, its severity and underlying risk factors can be significantly reduced. However, there is a shortage of labeled data and the occurrence of outliers or data missingness in clinical datasets that are reliable and effective for diabetes prediction, making it a challenging endeavor. Therefore, we introduce a newly labeled diabetes dataset from a South Asian nation (Bangladesh). In addition, we suggest an automated classification pipeline that includes a weighted ensemble of machine learning (ML) classifiers: Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), XGBoost (XGB), and LightGBM (LGB). Grid search hyperparameter optimization is employed to tune the critical hyperparameters of these ML models. Furthermore, missing value imputation, feature selection, and K-fold cross-validation are included in the framework design. A statistical analysis of variance (ANOVA) test reveals that the performance of diabetes prediction significantly improves when the proposed weighted ensemble (DT + RF + XGB + LGB) is executed with the introduced preprocessing, with the highest accuracy of 0.735 and an area under the ROC curve (AUC) of 0.832. In conjunction with the suggested ensemble model, our statistical imputation and RF-based feature selection techniques produced the best results for early diabetes prediction. Moreover, the presented new dataset will contribute to developing and implementing robust ML models for diabetes prediction utilizing population-level data.
Collapse
Affiliation(s)
- Aishwariya Dutta
- Department of Biomedical Engineering (BME), Khulna University of Engineering & Technology (KUET), Khulna 9203, Bangladesh
- Department of Biomedical Engineering (BME), Military Institute of Science and Technology (MIST), Mirpur Cantonment, Dhaka 1216, Bangladesh
| | - Md. Kamrul Hasan
- Department of Electrical and Electronic Engineering (EEE), Khulna University of Engineering & Technology (KUET), Khulna 9203, Bangladesh
| | - Mohiuddin Ahmad
- Department of Electrical and Electronic Engineering (EEE), Khulna University of Engineering & Technology (KUET), Khulna 9203, Bangladesh
| | - Md. Abdul Awal
- School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD 4072, Australia
- Electronics and Communication Engineering (ECE) Discipline, Khulna University (KU), Khulna 9208, Bangladesh
- Correspondence:
| | | | - Mehedi Masud
- Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia
| | - Hossam Meshref
- Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia
| |
Collapse
|
29
|
Cho E, Cho S, Kim M, Ediriweera TK, Seo D, Lee SS, Cha J, Jin D, Kim YK, Lee JH. Single nucleotide polymorphism marker combinations for classifying Yeonsan Ogye chicken using a machine learning approach. JOURNAL OF ANIMAL SCIENCE AND TECHNOLOGY 2022; 64:830-841. [PMID: 36287747 PMCID: PMC9574617 DOI: 10.5187/jast.2022.e64] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 07/15/2022] [Accepted: 08/01/2022] [Indexed: 11/27/2022]
Abstract
Genetic analysis has great potential as a tool to differentiate between different species and breeds of livestock. In this study, the optimal combinations of single nucleotide polymorphism (SNP) markers for discriminating the Yeonsan Ogye chicken (Gallus gallus domesticus) breed were identified using high-density 600K SNP array data. In 3,904 individuals from 198 chicken breeds, SNP markers specific to the target population were discovered through a case-control genome-wide association study (GWAS) and filtered out based on the linkage disequilibrium blocks. Significant SNP markers were selected by feature selection applying two machine learning algorithms: Random Forest (RF) and AdaBoost (AB). Using a machine learning approach, the 38 (RF) and 43 (AB) optimal SNP marker combinations for the Yeonsan Ogye chicken population demonstrated 100% accuracy. Hence, the GWAS and machine learning models used in this study can be efficiently utilized to identify the optimal combination of markers for discriminating target populations using multiple SNP markers.
Collapse
Affiliation(s)
- Eunjin Cho
- Department of Bio-AI Convergence, Chungnam
National University, Daejeon 34134, Korea
| | - Sunghyun Cho
- Research and Development Center,
Insilicogen Inc., Yongin 19654, Korea
| | - Minjun Kim
- Division of Animal and Dairy Science,
Chungnam National University, Daejeon 34134, Korea
| | | | - Dongwon Seo
- Department of Bio-AI Convergence, Chungnam
National University, Daejeon 34134, Korea,Research Institute TNT Research
Company, Jeonju 54810, Korea
| | | | - Jihye Cha
- Animal Genome & Bioinformatics,
National Institute of Animal Science, Rural Development
Administration, Wanju 55365, Korea
| | - Daehyeok Jin
- Animal Genetic Resources Research Center,
National Institute of Animal Science, Rural Development
Administration, Hamyang 50000, Korea
| | - Young-Kuk Kim
- Department of Bio-AI Convergence, Chungnam
National University, Daejeon 34134, Korea
| | - Jun Heon Lee
- Department of Bio-AI Convergence, Chungnam
National University, Daejeon 34134, Korea,Division of Animal and Dairy Science,
Chungnam National University, Daejeon 34134, Korea,Corresponding author: Jun Heon Lee,
Department of Bio-AI Convergence, Chungnam National University, Daejeon 34134,
Korea. Tel: +82-42-821-5779, E-mail:
| |
Collapse
|
30
|
Multichannel Acoustic Spectroscopy of the Human Body for Inviolable Biometric Authentication. BIOSENSORS 2022; 12:bios12090700. [PMID: 36140085 PMCID: PMC9496529 DOI: 10.3390/bios12090700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 08/26/2022] [Accepted: 08/29/2022] [Indexed: 11/17/2022]
Abstract
Specific features of the human body, such as fingerprint, iris, and face, are extensively used in biometric authentication. Conversely, the internal structure and material features of the body have not been explored extensively in biometrics. Bioacoustics technology is suitable for extracting information about the internal structure and biological and material characteristics of the human body. Herein, we report a biometric authentication method that enables multichannel bioacoustic signal acquisition with a systematic approach to study the effects of selectively distilled frequency features, increasing the number of sensing channels with respect to multiple fingers. The accuracy of identity recognition according to the number of sensing channels and the number of selectively chosen frequency features was evaluated using exhaustive combination searches and forward-feature selection. The technique was applied to test the accuracy of machine learning classification using 5,232 datasets from 54 subjects. By optimizing the scanning frequency and sensing channels, our method achieved an accuracy of 99.62%, which is comparable to existing biometric methods. Overall, the proposed biometric method not only provides an unbreakable, inviolable biometric but also can be applied anywhere in the body and can substantially broaden the use of biometrics by enabling continuous identity recognition on various body parts for biometric identity authentication.
Collapse
|
31
|
Shen C, Zhang K. Two-stage improved Grey Wolf optimization algorithm for feature selection on high-dimensional classification. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-021-00452-4 10.1007/s40747-021-00452-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
AbstractIn recent years, evolutionary algorithms have shown great advantages in the field of feature selection because of their simplicity and potential global search capability. However, most of the existing feature selection algorithms based on evolutionary computation are wrapper methods, which are computationally expensive, especially for high-dimensional biomedical data. To significantly reduce the computational cost, it is essential to study an effective evaluation method. In this paper, a two-stage improved gray wolf optimization (IGWO) algorithm for feature selection on high-dimensional data is proposed. In the first stage, a multilayer perceptron (MLP) network with group lasso regularization terms is first trained to construct an integer optimization problem using the proposed algorithm for pre-selection of features and optimization of the hidden layer structure. The dataset is compressed using the feature subset obtained in the first stage. In the second stage, a multilayer perceptron network with group lasso regularization terms is retrained using the compressed dataset, and the proposed algorithm is employed to construct the discrete optimization problem for feature selection. Meanwhile, a rapid evaluation strategy is constructed to mitigate the evaluation cost and improve the evaluation efficiency in the feature selection process. The effectiveness of the algorithm was analyzed on ten gene expression datasets. The experimental results show that the proposed algorithm not only removes almost more than 95.7% of the features in all datasets, but also has better classification accuracy on the test set. In addition, the advantages of the proposed algorithm in terms of time consumption, classification accuracy and feature subset size become more and more prominent as the dimensionality of the feature selection problem increases. This indicates that the proposed algorithm is particularly suitable for solving high-dimensional feature selection problems.
Collapse
|
32
|
Farooq M, van Dijk AD, Nijveen H, Mansoor S, de Ridder D. Genomic prediction in plants: opportunities for ensemble machine learning based approaches. F1000Res 2022; 11:802. [PMID: 37035464 PMCID: PMC10080209 DOI: 10.12688/f1000research.122437.1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/08/2022] [Indexed: 12/15/2022] Open
Abstract
Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might depend on a plethora of factors including sample size, number of markers, population structure and genetic architecture. Methods: Here, we investigate which problem and dataset characteristics are related to good performance of ML methods for genomic prediction. We compare the predictive performance of two frequently used ensemble ML methods (Random Forest and Extreme Gradient Boosting) with parametric methods including genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space regression (RKHS), BayesA and BayesB. To explore problem characteristics, we use simulated and real plant traits under different genetic complexity levels determined by the number of Quantitative Trait Loci (QTLs), heritability (h2 and h2e), population structure and linkage disequilibrium between causal nucleotides and other SNPs. Results: Decision tree based ensemble ML methods are a better choice for nonlinear phenotypes and are comparable to Bayesian methods for linear phenotypes in the case of large effect Quantitative Trait Nucleotides (QTNs). Furthermore, we find that ML methods are susceptible to confounding due to population structure but less sensitive to low linkage disequilibrium than linear parametric methods. Conclusions: Overall, this provides insights into the role of ML in GP as well as guidelines for practitioners.
Collapse
Affiliation(s)
- Muhammad Farooq
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
- Molecular Virology and Gene Silencing Lab, Agricultural Biotechnology Division, National Institute for Biotechnology and Genetic Engineering (NIBGE), Faisalabad, Punjab, 38000, Pakistan
| | - Aalt D.J. van Dijk
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
| | - Harm Nijveen
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
| | - Shahid Mansoor
- Molecular Virology and Gene Silencing Lab, Agricultural Biotechnology Division, National Institute for Biotechnology and Genetic Engineering (NIBGE), Faisalabad, Punjab, 38000, Pakistan
| | - Dick de Ridder
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
| |
Collapse
|
33
|
Jha AK, Mithun S, Purandare NC, Kumar R, Rangarajan V, Wee L, Dekker A. Radiomics: a quantitative imaging biomarker in precision oncology. Nucl Med Commun 2022; 43:483-493. [PMID: 35131965 DOI: 10.1097/mnm.0000000000001543] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Cancer treatment is heading towards precision medicine driven by genetic and biochemical markers. Various genetic and biochemical markers are utilized to render personalized treatment in cancer. In the last decade, noninvasive imaging biomarkers have also been developed to assist personalized decision support systems in oncology. The imaging biomarkers i.e., radiomics is being researched to develop specific digital phenotype of tumor in cancer. Radiomics is a process to extract high throughput data from medical images by using advanced mathematical and statistical algorithms. The radiomics process involves various steps i.e., image generation, segmentation of region of interest (e.g. a tumor), image preprocessing, radiomic feature extraction, feature analysis and selection and finally prediction model development. Radiomics process explores the heterogeneity, irregularity and size parameters of the tumor to calculate thousands of advanced features. Our study investigates the role of radiomics in precision oncology. Radiomics research has witnessed a rapid growth in the last decade with several studies published that show the potential of radiomics in diagnosis and treatment outcome prediction in oncology. Several radiomics based prediction models have been developed and reported in the literature to predict various prediction endpoints i.e., overall survival, progression-free survival and recurrence in various cancer i.e., brain tumor, head and neck cancer, lung cancer and several other cancer types. Radiomics based digital phenotypes have shown promising results in diagnosis and treatment outcome prediction in oncology. In the coming years, radiomics is going to play a significant role in precision oncology.
Collapse
Affiliation(s)
- Ashish Kumar Jha
- Department of Radiation Oncology (Maastro), GROW School for Oncology, Maastricht University Medical Centre+, The Netherlands
- Department of Nuclear Medicine and Molecular Imaging, Tata Memorial Hospital
- Homi Bhabha National Institute (HBNI), Deemed University, Mumbai
| | - Sneha Mithun
- Department of Radiation Oncology (Maastro), GROW School for Oncology, Maastricht University Medical Centre+, The Netherlands
- Department of Nuclear Medicine and Molecular Imaging, Tata Memorial Hospital
- Homi Bhabha National Institute (HBNI), Deemed University, Mumbai
| | - Nilendu C Purandare
- Department of Nuclear Medicine and Molecular Imaging, Tata Memorial Hospital
- Homi Bhabha National Institute (HBNI), Deemed University, Mumbai
| | - Rakesh Kumar
- Department of Nuclear Medicine, All India Institute of Medical Science, New Delhi, India
| | - Venkatesh Rangarajan
- Department of Nuclear Medicine and Molecular Imaging, Tata Memorial Hospital
- Homi Bhabha National Institute (HBNI), Deemed University, Mumbai
| | - Leonard Wee
- Department of Radiation Oncology (Maastro), GROW School for Oncology, Maastricht University Medical Centre+, The Netherlands
| | - Andre Dekker
- Department of Radiation Oncology (Maastro), GROW School for Oncology, Maastricht University Medical Centre+, The Netherlands
| |
Collapse
|
34
|
Yang Z, Liu X, Li T, Wu D, Wang J, Zhao Y, Han H. A systematic literature review of methods and datasets for anomaly-based network intrusion detection. Comput Secur 2022. [DOI: 10.1016/j.cose.2022.102675] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
35
|
A2BCF: An Automated ABC-Based Feature Selection Algorithm for Classification Models in an Education Application. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12073553] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Feature selection is an essential step of preprocessing in Machine Learning (ML) algorithms that can significantly impact the performance of ML models. It is considered one of the most crucial phases of automated ML (AutoML). Feature selection aims to find the optimal subset of features and remove the noninformative features from the dataset. Feature selection also reduces the computational time and makes the data more understandable to the learning model. There are various heuristic search strategies to address combinatorial optimization challenges. This paper develops an Automated Artificial Bee Colony-based algorithm for Feature Selection (A2BCF) to solve a classification problem. The application domain evaluating our proposed algorithm is education science, which solves a binary classification problem, namely, undergraduate student success. The modifications made to the original Artificial Bee Colony algorithm make the algorithm a well-performed approach.
Collapse
|
36
|
Sheh A, Artim SC, Burns MA, Molina-Mora JA, Lee MA, Dzink-Fox J, Muthupalani S, Fox JG. Alterations in common marmoset gut microbiome associated with duodenal strictures. Sci Rep 2022; 12:5277. [PMID: 35347206 PMCID: PMC8960757 DOI: 10.1038/s41598-022-09268-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Accepted: 03/21/2022] [Indexed: 12/13/2022] Open
Abstract
Chronic gastrointestinal (GI) diseases are the most common diseases in captive common marmosets (Callithrix jacchus). Despite standardized housing, diet and husbandry, a recently described gastrointestinal syndrome characterized by duodenal ulcers and strictures was observed in a subset of marmosets sourced from the New England Primate Research Center. As changes in the gut microbiome have been associated with GI diseases, the gut microbiome of 52 healthy, non-stricture marmosets (153 samples) were compared to the gut microbiome of 21 captive marmosets diagnosed with a duodenal ulcer/stricture (57 samples). No significant changes were observed using alpha diversity metrics, and while the community structure was significantly different when comparing beta diversity between healthy and stricture cases, the results were inconclusive due to differences observed in the dispersion of both datasets. Differences in the abundance of individual taxa using ANCOM, as stricture-associated dysbiosis was characterized by Anaerobiospirillum loss and Clostridium perfringens increases. To identify microbial and serum biomarkers that could help classify stricture cases, we developed models using machine learning algorithms (random forest, classification and regression trees, support vector machines and k-nearest neighbors) to classify microbiome, serum chemistry or complete blood count (CBC) data. Random forest (RF) models were the most accurate models and correctly classified strictures using either 9 ASVs (amplicon sequence variants), 4 serum chemistry tests or 6 CBC tests. Based on the RF model and ANCOM results, C. perfringens was identified as a potential causative agent associated with the development of strictures. Clostridium perfringens was also isolated by microbiological culture in 4 of 9 duodenum samples from marmosets with histologically confirmed strictures. Due to the enrichment of C. perfringens in situ, we analyzed frozen duodenal tissues using both 16S microbiome profiling and RNAseq. Microbiome analysis of the duodenal tissues of 29 marmosets from the MIT colony confirmed an increased abundance of Clostridium in stricture cases. Comparison of the duodenal gene expression from stricture and non-stricture marmosets found enrichment of genes associated with intestinal absorption, and lipid metabolism, localization, and transport in stricture cases. Using machine learning, we identified increased abundance of C. perfringens, as a potential causative agent of GI disease and intestinal strictures in marmosets.
Collapse
Affiliation(s)
- Alexander Sheh
- Division of Comparative Medicine, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Stephen C Artim
- Division of Comparative Medicine, Massachusetts Institute of Technology, Cambridge, MA, USA
- Merck Research Laboratories, Merck, South San Francisco, CA, USA
| | - Monika A Burns
- Division of Comparative Medicine, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jose Arturo Molina-Mora
- Centro de Investigación en Enfermedades Tropicales (CIET), Universidad de Costa Rica, San José, Costa Rica
| | - Mary Anne Lee
- Division of Comparative Medicine, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Biological Sciences, Wellesley College, Wellesley, MA, USA
| | - JoAnn Dzink-Fox
- Division of Comparative Medicine, Massachusetts Institute of Technology, Cambridge, MA, USA
| | | | - James G Fox
- Division of Comparative Medicine, Massachusetts Institute of Technology, Cambridge, MA, USA.
| |
Collapse
|
37
|
Ferrucci R, Mameli F, Ruggiero F, Reitano M, Miccoli M, Gemignani A, Conversano C, Dini M, Zago S, Piacentini S, Poletti B, Priori A, Orrù G. Alternate fluency in Parkinson’s disease: A machine learning analysis. PLoS One 2022; 17:e0265803. [PMID: 35320291 PMCID: PMC8942276 DOI: 10.1371/journal.pone.0265803] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Accepted: 03/08/2022] [Indexed: 11/18/2022] Open
Abstract
Objective
The aim of the present study was to investigate whether patients with Parkinson’s Disease (PD) had changes in their level of performance in extra-dimensional shifting by implementing a novel analysis method, utilizing the new alternate phonemic/semantic fluency test.
Method
We used machine learning (ML) in order to develop high accuracy classification between PD patients with high and low scores in the alternate fluency test.
Results
The models developed resulted to be accurate in such classification in a range between 80% and 90%. The predictor which demonstrated maximum efficiency in classifying the participants as low or high performers was the semantic fluency test. The optimal cut-off of a decision rule based on this test yielded an accuracy of 86.96%. Following the removal of the semantic fluency test from the system, the parameter which best contributed to the classification was the phonemic fluency test. The best cut-offs were identified and the decision rule yielded an overall accuracy of 80.43%. Lastly, in order to evaluate the classification accuracy based on the shifting index, the best cut-offs based on an optimal single rule yielded an overall accuracy of 83.69%.
Conclusion
We found that ML analysis of semantic and phonemic verbal fluency may be used to identify simple rules with high accuracy and good out of sample generalization, allowing the detection of executive deficits in patients with PD.
Collapse
Affiliation(s)
- Roberta Ferrucci
- Department of Health Sciences, Aldo Ravelli Research Center, University of Milan, Milan, Italy
- ASST-Santi Paolo e Carlo Hospital, Milan, Italy
- IRCCS Ca’ Granda Foundation, Policlinico of Milan, Milan, Italy
- * E-mail:
| | | | | | | | - Mario Miccoli
- Department of Clinical and Experimental Medicine, University of Pisa, Pisa, Italy
| | - Angelo Gemignani
- Department of Surgical, Medical, Molecular & Critical Area Pathology, University of Pisa, Pisa, Italy
| | - Ciro Conversano
- Department of Surgical, Medical, Molecular & Critical Area Pathology, University of Pisa, Pisa, Italy
| | - Michelangelo Dini
- Department of Health Sciences, Aldo Ravelli Research Center, University of Milan, Milan, Italy
| | - Stefano Zago
- IRCCS Ca’ Granda Foundation, Policlinico of Milan, Milan, Italy
| | | | | | - Alberto Priori
- Department of Health Sciences, Aldo Ravelli Research Center, University of Milan, Milan, Italy
- ASST-Santi Paolo e Carlo Hospital, Milan, Italy
| | - Graziella Orrù
- Department of Surgical, Medical, Molecular & Critical Area Pathology, University of Pisa, Pisa, Italy
| |
Collapse
|
38
|
Melek Manshouri N. Identifying COVID-19 by using spectral analysis of cough recordings: a distinctive classification study. Cogn Neurodyn 2022; 16:239-253. [PMID: 34341676 PMCID: PMC8320312 DOI: 10.1007/s11571-021-09695-w] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 05/04/2021] [Accepted: 07/01/2021] [Indexed: 12/19/2022] Open
Abstract
Sound signals from the respiratory system are largely taken as tokens of human health. Early diagnosis of respiratory tract diseases is of great importance because, if delayed, it exerts irreversible effects on human health. The Coronavirus pandemic, which is deeply shaking the world, has revealed the importance of this diagnosis even more. During the pandemic, it has become the focus of researchers to differentiate symptoms from similar diseases such as influenza. Among these symptoms, the difference in cough sound played a distinctive role in research. Clinical data collected under the supervision of doctors in a reliable environment were used as the dataset consisting of 16 subjects suspected of COVID-19 with a specific patient demographic. Using the polymerase chain reaction test, the suspected subjects were divided into two groups as negative and positive. The negative and positive labels represent the patients with non-COVID and with a COVID-19 cough, respectively. Using the 3D plot or waterfall representation of the signal frequency spectrum, the salient features of the cough data are revealed. In this way, COVID-19 can be differentiated from other coughs by applying effective feature extraction and classification techniques. Power spectral density based on short-time Fourier transform and mel-frequency cepstral coefficients (MFCC) were chosen as the efficient feature extraction method. From among the classification techniques, the support vector machine (SVM) algorithm was applied to the processed signals in order to identify and classify COVID-19 cough. In terms of results evaluation, the cough of subjects with COVID-19 was detected with 95.86% classification accuracy thanks to the radial basis function (RBF) kernel function of SVM and the MFCC method. The diagnosis of COVID-19 coughs was performed with 98.6% and 91.7% sensitivity and specificity, respectively.
Collapse
Affiliation(s)
- Negin Melek Manshouri
- Department of Electrical and Electronics Engineering, Faculty of Engineering, Avrasya University, 61080 Trabzon, Turkey
| |
Collapse
|
39
|
Zhang Y, Ma Y, Yang X. Multi-label feature selection based on logistic regression and manifold learning. APPL INTELL 2022. [DOI: 10.1007/s10489-021-03008-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
40
|
Xu HZ, Peng XR, Liu YR, Lei X, Yu J. Sleep Quality Modulates the Association between Dynamic Functional Network Connectivity and Cognitive Function in Healthy Older Adults. Neuroscience 2022; 480:131-142. [PMID: 34785273 DOI: 10.1016/j.neuroscience.2021.11.018] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Revised: 11/01/2021] [Accepted: 11/08/2021] [Indexed: 12/12/2022]
Abstract
Aging is associated with changes in sleep, brain activity, and cognitive function, as well as the association among these factors; however, the precise nature of these changes has not been elucidated. This study systematically investigated the modulatory effect of sleep on the relationship between brain functional network connectivity (FNC) and cognitive function in older adults. In total, 107 community-dwelling healthy older adults were recruited and assigned into poor sleep and good sleep groups based on the Pittsburgh Sleep Quality Index. The static functional network connectivity (sFNC), the temporal variability of dynamic FNC (dFNC) from variance (dFNC-var), and the dFNC from clustering state (dFNC-state) were calculated. Corresponding cognition-predictive models were constructed for each sleep group. dFNC but not sFNC, was able to significantly predict the cognitive function in older adults. Specifically, sleep played a modulatory role in the association between dFNC and cognitive function, with sleep-specific variations at both microscopic (i.e., specific edges) and macroscopic levels (i.e., specific states) of dFNC.
Collapse
Affiliation(s)
- Hong-Zhou Xu
- Faculty of Psychology, Southwest University, Chongqing, China
| | - Xue-Rui Peng
- Faculty of Psychology, Southwest University, Chongqing, China; Lifespan Developmental Neuroscience, Faculty of Psychology, Technische Universität Dresden, Dresden, Germany
| | - Yun-Rui Liu
- Faculty of Psychology, Southwest University, Chongqing, China; Center for Cognitive and Decision Sciences, Faculty of Psychology, University of Basel, Basel, Switzerland
| | - Xu Lei
- Faculty of Psychology, Southwest University, Chongqing, China
| | - Jing Yu
- Faculty of Psychology, Southwest University, Chongqing, China; Key Laboratory of Mental Health, Institute of Psychology, Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|
41
|
Masoomi Sefiddashti F, Asadpour S, Haddadi H, Ghanavati Nasab S. QSAR analysis of pyrimidine derivatives as VEGFR-2 receptor inhibitors to inhibit cancer using multiple linear regression and artificial neural network. Res Pharm Sci 2021; 16:596-611. [PMID: 34760008 PMCID: PMC8562410 DOI: 10.4103/1735-5362.327506] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Revised: 04/13/2021] [Accepted: 09/22/2021] [Indexed: 11/15/2022] Open
Abstract
BACKGROUND AND PURPOSE In this study, the pharmacological activity of 33 compounds of furopyrimidine and thienopyrimidine as vascular endothelial growth factor receptor 2 (VEGFR-2) inhibitors to inhibit cancer was investigated. The most important angiogenesis inducer is VEGF endothelial growth factor, which exerts its activity by binding to two tyrosine kinase receptors called VEGFR-1 and VEGFR-2. Due to the critical role of VEGF in the pathological angiogenesis of this molecule, it is a valuable therapeutic target for anti-angiogenesis therapies. EXPERIMENTAL APPROACH After calculating descriptors using SPSS software and stepwise selection method, 5 descriptors were used for modeling in multiple linear regression (MLR) and artificial neural network (ANN). The calibration series and the test series in this study included 26 and 7 combinations, respectively. FINDINGS/RESULTS The performance evaluation of models was determined by the R2, RMSE, and Q2 statistic parameters. The R2 values of MLR and ANN models were 0.889 and 0.998, respectively. Also, the value of RMSE in the ANN model was lower and its Q2 value was higher than the MLR model. CONCLUSION AND IMPLICATIONS The results were evaluated by different statistical methods and it was concluded that the nonlinear neural network method is powerful to predict the pharmacological activity of similar compounds, and because of the complex and nonlinear relationships, the MLR was not capable of establishing a good model with high predictive power.
Collapse
Affiliation(s)
| | - Saeid Asadpour
- Department of Chemistry, Faculty of Sciences, Shahrekord University, Shahrekord, I.R. Iran
| | - Hedayat Haddadi
- Department of Chemistry, Faculty of Sciences, Shahrekord University, Shahrekord, I.R. Iran
| | - Shima Ghanavati Nasab
- Department of Chemistry, Faculty of Sciences, Shahrekord University, Shahrekord, I.R. Iran
| |
Collapse
|
42
|
Sajjadian M, Lam RW, Milev R, Rotzinger S, Frey BN, Soares CN, Parikh SV, Foster JA, Turecki G, Müller DJ, Strother SC, Farzan F, Kennedy SH, Uher R. Machine learning in the prediction of depression treatment outcomes: a systematic review and meta-analysis. Psychol Med 2021; 51:2742-2751. [PMID: 35575607 DOI: 10.1017/s0033291721003871] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
BACKGROUND Multiple treatments are effective for major depressive disorder (MDD), but the outcomes of each treatment vary broadly among individuals. Accurate prediction of outcomes is needed to help select a treatment that is likely to work for a given person. We aim to examine the performance of machine learning methods in delivering replicable predictions of treatment outcomes. METHODS Of 7732 non-duplicate records identified through literature search, we retained 59 eligible reports and extracted data on sample, treatment, predictors, machine learning method, and treatment outcome prediction. A minimum sample size of 100 and an adequate validation method were used to identify adequate-quality studies. The effects of study features on prediction accuracy were tested with mixed-effects models. Fifty-four of the studies provided accuracy estimates or other estimates that allowed calculation of balanced accuracy of predicting outcomes of treatment. RESULTS Eight adequate-quality studies reported a mean accuracy of 0.63 [95% confidence interval (CI) 0.56-0.71], which was significantly lower than a mean accuracy of 0.75 (95% CI 0.72-0.78) in the other 46 studies. Among the adequate-quality studies, accuracies were higher when predicting treatment resistance (0.69) and lower when predicting remission (0.60) or response (0.56). The choice of machine learning method, feature selection, and the ratio of features to individuals were not associated with reported accuracy. CONCLUSIONS The negative relationship between study quality and prediction accuracy, combined with a lack of independent replication, invites caution when evaluating the potential of machine learning applications for personalizing the treatment of depression.
Collapse
Affiliation(s)
- Mehri Sajjadian
- Department of Psychiatry, Dalhousie University, Halifax, NS, Canada
| | - Raymond W Lam
- Department of Psychiatry, University of British Columbia, Vancouver, BC, Canada
| | - Roumen Milev
- Department of Psychiatry and Psychology, Queen's University, Providence Care Hospital, Kingston, ON, Canada
| | - Susan Rotzinger
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
- Department of Psychiatry, St. Michael's Hospital, University of Toronto, Toronto, Ontario, Canada
| | - Benicio N Frey
- Department of Psychiatry and Behavioural Neurosciences, McMaster University, Hamilton, ON, Canada
- Mood Disorders Program and Women's Health Concerns Clinic, St. Joseph's Healthcare Hamilton, Hamilton, ON, Canada
| | - Claudio N Soares
- Department of Psychiatry, Queen's University School of Medicine, Kingston, ON, Canada
| | - Sagar V Parikh
- Department of Psychiatry, University of Michigan, Ann Arbor, MI, USA
| | - Jane A Foster
- Department of Psychiatry & Behavioural Neurosciences, St. Joseph's Healthcare, Hamilton, ON, Canada
| | - Gustavo Turecki
- Department of Psychiatry, Douglas Institute, McGill University, Montreal, QC, Canada
| | - Daniel J Müller
- Campbell Family Mental Health Research Institute, Center for Addiction and Mental Health, Toronto, ON, Canada
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| | - Stephen C Strother
- Baycrest and Department of Medical Biophysics, Rotman Research Center, University of Toronto, Toronto, ON, Canada
| | - Faranak Farzan
- eBrain Lab, School of Mechatronic Systems Engineering, Simon Fraser University, Surrey, BC, Canada
| | - Sidney H Kennedy
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
- Department of Psychiatry, St. Michael's Hospital, University of Toronto, Toronto, Ontario, Canada
- Department of Psychiatry, University Health Network, Toronto, ON, Canada
- Krembil Research Centre, University Health Network, University of Toronto, Toronto, ON, Canada
| | - Rudolf Uher
- Department of Psychiatry, Dalhousie University, Halifax, NS, Canada
| |
Collapse
|
43
|
Ly QV, Nguyen XC, Lê NC, Truong TD, Hoang THT, Park TJ, Maqbool T, Pyo J, Cho KH, Lee KS, Hur J. Application of Machine Learning for eutrophication analysis and algal bloom prediction in an urban river: A 10-year study of the Han River, South Korea. THE SCIENCE OF THE TOTAL ENVIRONMENT 2021; 797:149040. [PMID: 34311376 DOI: 10.1016/j.scitotenv.2021.149040] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 06/29/2021] [Accepted: 07/10/2021] [Indexed: 06/13/2023]
Abstract
The increasing release of nutrients to aquatic environments has led to great concern regarding eutrophication and the risk of unwanted algal blooms. Based on observational data of 20 water quality parameters measured on a monthly basis at 40 stations from 2011 to 2020, this study applied different Machine Learning (ML) algorithms to suggest the best option for algal bloom prediction in the Han River, a large river in South Korea. Eight different ML algorithms were categorized into several groups of statistical learning, regression family, and deep learning, and were then compared for their suitability to predict the chlorophyll-derived trophic index (TSI-Chla). ML algorithms helped identify the most important water quality parameters contributing to algal bloom prediction. The ML results confirmed that eutrophication and algal proliferation were governed by the complex interplay between nutrients (nitrogen and phosphorus), organic contaminants, and environmental factors. Of the models tested, the adaptive neuro-fuzzy inference system (ANFIS) exhibited the best performance owing to its consistent and outperforming prediction both quantitatively (i.e., via regression) and qualitatively (i.e., via classification), which was evidenced by the lowest value of mean absolute error (MAE) of 0.09, and the highest F1-score, Recall and Precision of 0.97, 0.98 and 0.96, respectively. In a further step, a representative web application was constructed to assist common users to predict the trophic status of the Han River. This study demonstrated that ML techniques are not only promising for highly accurate water quality modeling of urban rivers, but also reduce time and labor intensity for experiments, which decreases the number of monitored water quality parameters, providing further insights into the driving factors of water quality deterioration. They ultimately help devise proactive strategies for sustainable water management.
Collapse
Affiliation(s)
- Quang Viet Ly
- Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, Guangdong, China
| | - Xuan Cuong Nguyen
- Laboratory of Energy and Environmental Science, Institute of Research and Development, Duy Tan University, Da Nang 550000, Vietnam; Faculty of Environmental and Chemical Engineering, Duy Tan University, Da Nang 550000, Vietnam
| | - Ngoc C Lê
- School of Applied Mathematics and Informatics, Hanoi University of Science and Technology, Hanoi 100000, Vietnam
| | - Tien-Dung Truong
- School of Applied Mathematics and Informatics, Hanoi University of Science and Technology, Hanoi 100000, Vietnam
| | - Thu-Huong T Hoang
- School of Environmental Science and Technology, Hanoi University of Science and Technology, Hanoi 100000, Vietnam.
| | - Tae Jun Park
- Department of Environment and Energy, Sejong University, Seoul 05006, South Korea
| | - Tahir Maqbool
- Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, Guangdong, China
| | - JongCheol Pyo
- Center for Environmental Data Strategy, Korea Environment Institute, Sejong 30147, South Korea
| | - Kyung Hwa Cho
- School of Urban and Environmental Engineering, Ulsan National Institute of Science and Technology, 50 UNIST-gil, Eonyang-eup, Ulju-gun, Ulsan 44919, South Korea
| | - Kwang-Sik Lee
- Korea Basic Science Institute, Yeongudanji-ro 162, Cheongwon-gu, Cheongju, Chungcheongbuk-do 28119, South Korea
| | - Jin Hur
- Department of Environment and Energy, Sejong University, Seoul 05006, South Korea.
| |
Collapse
|
44
|
Differentiation of Cystic Fibrosis-Related Pathogens by Volatile Organic Compound Analysis with Secondary Electrospray Ionization Mass Spectrometry. Metabolites 2021; 11:metabo11110773. [PMID: 34822431 PMCID: PMC8617967 DOI: 10.3390/metabo11110773] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Revised: 11/05/2021] [Accepted: 11/08/2021] [Indexed: 12/02/2022] Open
Abstract
Identifying and differentiating bacteria based on their emitted volatile organic compounds (VOCs) opens vast opportunities for rapid diagnostics. Secondary electrospray ionization high-resolution mass spectrometry (SESI-HRMS) is an ideal technique for VOC-biomarker discovery because of its speed, sensitivity towards polar molecules and compound characterization possibilities. Here, an in vitro SESI-HRMS workflow to find biomarkers for cystic fibrosis (CF)-related pathogens P. aeruginosa, S. pneumoniae, S. aureus, H. influenzae, E. coli and S. maltophilia is described. From 180 headspace samples, the six pathogens are distinguishable in the first three principal components and predictive analysis with a support vector machine algorithm using leave-one-out cross-validation exhibited perfect accuracy scores for the differentiation between the groups. Additionally, 94 distinctive features were found by recursive feature elimination and further characterized by SESI-MS/MS, which yielded 33 putatively identified biomarkers. In conclusion, the six pathogens can be distinguished in vitro based on their VOC profiles as well as the herein reported putative biomarkers. In the future, these putative biomarkers might be helpful for pathogen detection in vivo based on breath samples from patients with CF.
Collapse
|
45
|
Dalvie S, Chatzinakos C, Al Zoubi O, Georgiadis F, PGC-PTSD Systems Biology workgroup, Lancashire L, Daskalakis NP. From genetics to systems biology of stress-related mental disorders. Neurobiol Stress 2021; 15:100393. [PMID: 34584908 PMCID: PMC8456113 DOI: 10.1016/j.ynstr.2021.100393] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Revised: 07/22/2021] [Accepted: 09/08/2021] [Indexed: 01/20/2023] Open
Abstract
Many individuals will be exposed to some form of traumatic stress in their lifetime which, in turn, increases the likelihood of developing stress-related disorders such as post-traumatic stress disorder (PTSD), major depressive disorder (MDD) and anxiety disorders (ANX). The development of these disorders is also influenced by genetics and have heritability estimates ranging between ∼30 and 70%. In this review, we provide an overview of the findings of genome-wide association studies for PTSD, depression and ANX, and we observe a clear genetic overlap between these three diagnostic categories. We go on to highlight the results from transcriptomic and epigenomic studies, and, given the multifactorial nature of stress-related disorders, we provide an overview of the gene-environment studies that have been conducted to date. Finally, we discuss systems biology approaches that are now seeing wider utility in determining a more holistic view of these complex disorders.
Collapse
Affiliation(s)
- Shareefa Dalvie
- South African Medical Research Council (SAMRC), Unit on Risk & Resilience in Mental Disorders, Department of Psychiatry and Neuroscience Institute, University of Cape Town, Cape Town, South Africa
- South African Medical Research Council (SAMRC), Unit on Child & Adolescent Health, Department of Paediatrics and Child Health, University of Cape Town, Cape Town, South Africa
| | - Chris Chatzinakos
- Department of Psychiatry, McLean Hospital, Harvard Medical School, Belmont, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, USA
| | - Obada Al Zoubi
- Department of Psychiatry, McLean Hospital, Harvard Medical School, Belmont, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, USA
| | - Foivos Georgiadis
- Department of Psychiatry, McLean Hospital, Harvard Medical School, Belmont, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, USA
| | | | - Lee Lancashire
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, USA
- Department of Data Science, Cohen Veterans Bioscience, New York, USA
| | - Nikolaos P. Daskalakis
- Department of Psychiatry, McLean Hospital, Harvard Medical School, Belmont, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, USA
| |
Collapse
|
46
|
Monaco A, Pantaleo E, Amoroso N, Lacalamita A, Lo Giudice C, Fonzino A, Fosso B, Picardi E, Tangaro S, Pesole G, Bellotti R. A primer on machine learning techniques for genomic applications. Comput Struct Biotechnol J 2021; 19:4345-4359. [PMID: 34429852 PMCID: PMC8365460 DOI: 10.1016/j.csbj.2021.07.021] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 07/23/2021] [Accepted: 07/23/2021] [Indexed: 11/28/2022] Open
Abstract
High throughput sequencing technologies have enabled the study of complex biological aspects at single nucleotide resolution, opening the big data era. The analysis of large volumes of heterogeneous "omic" data, however, requires novel and efficient computational algorithms based on the paradigm of Artificial Intelligence. In the present review, we introduce and describe the most common machine learning methodologies, and lately deep learning, applied to a variety of genomics tasks, trying to emphasize capabilities, strengths and limitations through a simple and intuitive language. We highlight the power of the machine learning approach in handling big data by means of a real life example, and underline how described methods could be relevant in all cases in which large amounts of multimodal genomic data are available.
Collapse
Affiliation(s)
- Alfonso Monaco
- Istituto Nazionale di Fisica Nucleare (INFN), Sezione di Bari, Via A. Orabona 4, 70125 Bari, Italy
| | - Ester Pantaleo
- Dipartimento Interateneo di Fisica "M. Merlin", Università degli Studi di Bari "Aldo Moro", Via G. Amendola 173, 70125 Bari, Italy
| | - Nicola Amoroso
- Istituto Nazionale di Fisica Nucleare (INFN), Sezione di Bari, Via A. Orabona 4, 70125 Bari, Italy.,Dipartimento di Farmacia - Scienze del Farmaco, Università degli Studi di Bari "Aldo Moro", Via A. Orabona 4, 70125 Bari, Italy
| | - Antonio Lacalamita
- National Institute of Gastroenterology "S. de Bellis", Research Hospital, 70013 Castellana Grotte (Bari), Italy
| | - Claudio Lo Giudice
- Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari "Aldo Moro", Via A. Orabona 4, 70125 Bari, Italy
| | - Adriano Fonzino
- Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari "Aldo Moro", Via A. Orabona 4, 70125 Bari, Italy
| | - Bruno Fosso
- Istituto di Biomembrane, Bioenergetica e Biotecnologie Molecolari, Consiglio Nazionale delle Ricerche, Via G. Amendola 122/O, 70126 Bari, Italy
| | - Ernesto Picardi
- Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari "Aldo Moro", Via A. Orabona 4, 70125 Bari, Italy.,Istituto di Biomembrane, Bioenergetica e Biotecnologie Molecolari, Consiglio Nazionale delle Ricerche, Via G. Amendola 122/O, 70126 Bari, Italy
| | - Sabina Tangaro
- Istituto Nazionale di Fisica Nucleare (INFN), Sezione di Bari, Via A. Orabona 4, 70125 Bari, Italy.,Dipartimento di Scienze del Suolo, della Pianta e degli Alimenti, Università degli Studi di Bari "Aldo Moro", Bari, Via G. Amendola 165, 70125 Bari, Italy
| | - Graziano Pesole
- Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari "Aldo Moro", Via A. Orabona 4, 70125 Bari, Italy.,Istituto di Biomembrane, Bioenergetica e Biotecnologie Molecolari, Consiglio Nazionale delle Ricerche, Via G. Amendola 122/O, 70126 Bari, Italy
| | - Roberto Bellotti
- Istituto Nazionale di Fisica Nucleare (INFN), Sezione di Bari, Via A. Orabona 4, 70125 Bari, Italy.,Dipartimento Interateneo di Fisica "M. Merlin", Università degli Studi di Bari "Aldo Moro", Via G. Amendola 173, 70125 Bari, Italy
| |
Collapse
|
47
|
Two-stage improved Grey Wolf optimization algorithm for feature selection on high-dimensional classification. COMPLEX INTELL SYST 2021. [DOI: 10.1007/s40747-021-00452-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
AbstractIn recent years, evolutionary algorithms have shown great advantages in the field of feature selection because of their simplicity and potential global search capability. However, most of the existing feature selection algorithms based on evolutionary computation are wrapper methods, which are computationally expensive, especially for high-dimensional biomedical data. To significantly reduce the computational cost, it is essential to study an effective evaluation method. In this paper, a two-stage improved gray wolf optimization (IGWO) algorithm for feature selection on high-dimensional data is proposed. In the first stage, a multilayer perceptron (MLP) network with group lasso regularization terms is first trained to construct an integer optimization problem using the proposed algorithm for pre-selection of features and optimization of the hidden layer structure. The dataset is compressed using the feature subset obtained in the first stage. In the second stage, a multilayer perceptron network with group lasso regularization terms is retrained using the compressed dataset, and the proposed algorithm is employed to construct the discrete optimization problem for feature selection. Meanwhile, a rapid evaluation strategy is constructed to mitigate the evaluation cost and improve the evaluation efficiency in the feature selection process. The effectiveness of the algorithm was analyzed on ten gene expression datasets. The experimental results show that the proposed algorithm not only removes almost more than 95.7% of the features in all datasets, but also has better classification accuracy on the test set. In addition, the advantages of the proposed algorithm in terms of time consumption, classification accuracy and feature subset size become more and more prominent as the dimensionality of the feature selection problem increases. This indicates that the proposed algorithm is particularly suitable for solving high-dimensional feature selection problems.
Collapse
|
48
|
Morgenstern JD, Rosella LC, Costa AP, de Souza RJ, Anderson LN. Perspective: Big Data and Machine Learning Could Help Advance Nutritional Epidemiology. Adv Nutr 2021; 12:621-631. [PMID: 33606879 PMCID: PMC8166570 DOI: 10.1093/advances/nmaa183] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Revised: 11/04/2020] [Accepted: 12/29/2020] [Indexed: 01/09/2023] Open
Abstract
The field of nutritional epidemiology faces challenges posed by measurement error, diet as a complex exposure, and residual confounding. The objective of this perspective article is to highlight how developments in big data and machine learning can help address these challenges. New methods of collecting 24-h dietary recalls and recording diet could enable larger samples and more repeated measures to increase statistical power and measurement precision. In addition, use of machine learning to automatically classify pictures of food could become a useful complimentary method to help improve precision and validity of dietary measurements. Diet is complex due to thousands of different foods that are consumed in varying proportions, fluctuating quantities over time, and differing combinations. Current dietary pattern methods may not integrate sufficient dietary variation, and most traditional modeling approaches have limited incorporation of interactions and nonlinearity. Machine learning could help better model diet as a complex exposure with nonadditive and nonlinear associations. Last, novel big data sources could help avoid unmeasured confounding by offering more covariates, including both omics and features derived from unstructured data with machine learning methods. These opportunities notwithstanding, application of big data and machine learning must be approached cautiously to ensure quality of dietary measurements, avoid overfitting, and confirm accurate interpretations. Greater use of machine learning and big data would also require substantial investments in training, collaborations, and computing infrastructure. Overall, we propose that judicious application of big data and machine learning in nutrition science could offer new means of dietary measurement, more tools to model the complexity of diet and its relations with diseases, and additional potential ways of addressing confounding.
Collapse
Affiliation(s)
- Jason D Morgenstern
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Laura C Rosella
- Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Andrew P Costa
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Russell J de Souza
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
- Population Health Research Institute, Hamilton Health Sciences, Hamilton, Ontario, Canada
| | - Laura N Anderson
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
49
|
Weight Feedback-Based Harmonic MDG-Ensemble Model for Prediction of Traffic Accident Severity. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11115072] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Traffic accidents are emerging as a serious social problem in modern society but if the severity of an accident is quickly grasped, countermeasures can be organized efficiently. To solve this problem, the method proposed in this paper derives the MDG (Mean Decrease Gini) coefficient between variables to assess the severity of traffic accidents. Single models are designed to use coefficient, independent variables to determine and predict accident severity. The generated single models are fused using a weighted-voting-based bagging method ensemble to consider various characteristics and avoid overfitting. The variables used for predicting accidents are classified as dependent or independent and the variables that affect the severity of traffic accidents are predicted using the characteristics of causal relationships. Independent variables are classified as categorical and numerical variables. For this reason, a problem arises when the variation among dependent variables is imbalanced. Therefore, a harmonic average is applied to the weights to maintain the variables’ balance and determine the average rate of change. Through this, it is possible to establish objective criteria for determining the severity of traffic accidents, thereby improving reliability.
Collapse
|
50
|
Reproducible Evaluation of Diffusion MRI Features for Automatic Classification of Patients with Alzheimer's Disease. Neuroinformatics 2021; 19:57-78. [PMID: 32524428 DOI: 10.1007/s12021-020-09469-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Diffusion MRI is the modality of choice to study alterations of white matter. In past years, various works have used diffusion MRI for automatic classification of Alzheimer's disease. However, classification performance obtained with different approaches is difficult to compare because of variations in components such as input data, participant selection, image preprocessing, feature extraction, feature rescaling (FR), feature selection (FS) and cross-validation (CV) procedures. Moreover, these studies are also difficult to reproduce because these different components are not readily available. In a previous work (Samper-González et al. 2018), we propose an open-source framework for the reproducible evaluation of AD classification from T1-weighted (T1w) MRI and PET data. In the present paper, we first extend this framework to diffusion MRI data. Specifically, we add: conversion of diffusion MRI ADNI data into the BIDS standard and pipelines for diffusion MRI preprocessing and feature extraction. We then apply the framework to compare different components. First, FS has a positive impact on classification results: highest balanced accuracy (BA) improved from 0.76 to 0.82 for task CN vs AD. Secondly, voxel-wise features generally gives better performance than regional features. Fractional anisotropy (FA) and mean diffusivity (MD) provided comparable results for voxel-wise features. Moreover, we observe that the poor performance obtained in tasks involving MCI were potentially caused by the small data samples, rather than by the data imbalance. Furthermore, no extensive classification difference exists for different degree of smoothing and registration methods. Besides, we demonstrate that using non-nested validation of FS leads to unreliable and over-optimistic results: 5% up to 40% relative increase in BA. Lastly, with proper FR and FS, the performance of diffusion MRI features is comparable to that of T1w MRI. All the code of the framework and the experiments are publicly available: general-purpose tools have been integrated into the Clinica software package ( www.clinica.run ) and the paper-specific code is available at: https://github.com/aramis-lab/AD-ML .
Collapse
|