1
|
Li Z, Li J, Li S, Wang Y, Wang J. Acute Myeloid Leukemia Genome Characterization Study and Subtype Classification Employing Feature Selection and Bayesian Networks. Biomedicines 2025; 13:1067. [PMID: 40426895 DOI: 10.3390/biomedicines13051067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2025] [Revised: 04/24/2025] [Accepted: 04/25/2025] [Indexed: 05/29/2025] Open
Abstract
Background: The precise diagnosis and classification of acute myeloid leukemia (AML) has important implications for clinical management and medical research. Methods: We investigated the expression of protein-coding genes in blood samples from AML patients and controls using The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) databases. Subsequently, we applied the feature selection method of the least absolute shrinkage and selection operator (LASSO) to select the optimal gene subset for classifying AML patients and controls as well as between a particular FAB subtype and other subtypes of AML. Results: Using LASSO method, we identified a subset of 101 genes that could effectively distinguish between AML patients and control individuals; these genes included 70 up-regulated and 31 down-regulated genes in AML. Functional annotation and pathway analysis indicated the involvement of these genes in RNA-related pathways, which was also consistent with the epigenetic changes observed in AML. Results from survival analysis revealed that several genes are correlated with the overall survival in AML patients. Additionally, LASSO-based gene subset analysis successfully revealed differences between certain AML subtypes, providing valuable insights into subtype-specific molecular mechanisms and differentiation therapy. Conclusions: This study demonstrated the application of machine learning in genomic data analysis for identifying gene subsets relevant to AML diagnosis and classification, which could aid in improving the understanding of the molecular landscape of AML. The identification of survival-related genes and subtype-specific markers may lead to the identification of novel targets for personalized medicine in the treatment of AML.
Collapse
Affiliation(s)
- Zhenzhen Li
- Research & Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen 518057, China
- Xi'an Key Laboratory of Stem Cell and Regenerative Medicine, Institute of Medical Research, Northwestern Polytechnical University, Xi'an 710072, China
| | - Jingwen Li
- Research & Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen 518057, China
- Xi'an Key Laboratory of Stem Cell and Regenerative Medicine, Institute of Medical Research, Northwestern Polytechnical University, Xi'an 710072, China
| | - Sifan Li
- Research & Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen 518057, China
- Xi'an Key Laboratory of Stem Cell and Regenerative Medicine, Institute of Medical Research, Northwestern Polytechnical University, Xi'an 710072, China
| | - Yangyang Wang
- School of Physics and Electronic Information, Yan'an University, Yan'an 716000, China
| | - Jihan Wang
- Yan'an Medical College, Yan'an University, Yan'an 716000, China
| |
Collapse
|
2
|
Idrisoglu A, Moraes ALD, Cheddad A, Anderberg P, Jakobsson A, Berglund JS. Vowel segmentation impact on machine learning classification for chronic obstructive pulmonary disease. Sci Rep 2025; 15:9930. [PMID: 40121302 PMCID: PMC11929820 DOI: 10.1038/s41598-025-95320-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Accepted: 03/20/2025] [Indexed: 03/25/2025] Open
Abstract
Vowel-based voice analysis is gaining attention as a potential non-invasive tool for COPD classification, offering insights into phonatory function. The growing need for voice data has necessitated the adoption of various techniques, including segmentation, to augment existing datasets for training comprehensive Machine Learning (ML) modelsThis study aims to investigate the possible effects of segmentation of the utterance of vowel "a" on the performance of ML classifiers CatBoost (CB), Random Forest (RF), and Support Vector Machine (SVM). This research involves training individual ML models using three distinct dataset constructions: full-sequence, segment-wise, and group-wise, derived from the utterance of the vowel "a" which consists of 1058 recordings belonging to 48 participants. This approach comprehensively analyzes how each data categorization impacts the model's performance and results. A nested cross-validation (nCV) approach was implemented with grid search for hyperparameter optimization. This rigorous methodology was employed to minimize overfitting risks and maximize model performance. Compared to the full-sequence dataset, the findings indicate that the second segment yielded higher results within the four-segment category. Specifically, the CB model achieved superior accuracy, attaining 97.8% and 84.6% on the validation and test sets, respectively. The same category for the CB model also demonstrated the best balance regarding true positive rate (TPR) and true negative rate (TNR), making it the most clinically effective choice. These findings suggest that time-sensitive properties in vowel production are important for COPD classification and that segmentation can aid in capturing these properties. Despite these promising results, the dataset size and demographic homogeneity limit generalizability, highlighting areas for future research.Trial registration The study is registered on clinicaltrials.gov with ID: NCT06160674.
Collapse
Affiliation(s)
- Alper Idrisoglu
- Department of Health, Blekinge Institute of Technology, 371 41, Karlskrona, Sweden.
| | | | - Abbas Cheddad
- Department of Health, Blekinge Institute of Technology, 371 41, Karlskrona, Sweden
- Institute of Computer Science, University of Tartu, Narva mnt 18, 51009, Tartu, Estonia
| | - Peter Anderberg
- Department of Health, Blekinge Institute of Technology, 371 41, Karlskrona, Sweden
| | | | | |
Collapse
|
3
|
Brinkac LM, Richetelli N, Davoren JM, Bever RA, Hicklin RA. DNAmix 2021: Laboratory policies, procedures, and casework scenarios summary and dataset. Data Brief 2023; 48:109150. [PMID: 37128591 PMCID: PMC10147962 DOI: 10.1016/j.dib.2023.109150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 04/05/2023] [Accepted: 04/06/2023] [Indexed: 05/03/2023] Open
Abstract
DNAmix 2021 is a large-scale study conducted to evaluate the extent of consistency and variation among forensic laboratories in the interpretation of DNA mixtures, and to assess the effects of various potential sources of variability. This study utilized a multi-phasic approach designed to collect information about participating laboratories, laboratory policies, and their standard operating procedures (SOPs). It also characterizes the degree of variation in assessments of suitability and number of contributors as well as in comparisons and statistical analyses of DNA mixture profiles. This paper specifically details the study design and the data collected in the first two phases of the study: the Policies & Procedures (P&P) Questionnaire and the Casework Scenarios Questionnaire (CSQ). We report on the variation in policies and SOPs for 86 forensic laboratories-including information about their DNA workflows, systems, and type of statistics reported. We also provide details regarding various case-scenario specific decisions and the nature of mixture casework for 83 forensic laboratories. The data discussed in this article provide insight into the state of the field for forensic DNA mixture interpretation policies and SOPs at the time of the study (2021-2022).
Collapse
|
4
|
B J, Hosahatti R, Koti PS, Devappa VH, Ngangkham U, Devanna P, Yadav MK, Mishra KK, Aditya JP, Boraiah PK, Gaber A, Hossain A. Phenotypic and Genotypic screening of fifty-two rice (Oryza sativa L.) genotypes for desirable cultivars against blast disease. PLoS One 2023; 18:e0280762. [PMID: 36897889 PMCID: PMC10004593 DOI: 10.1371/journal.pone.0280762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 01/08/2023] [Indexed: 03/11/2023] Open
Abstract
Magnaporthe oryzae, the rice blast fungus, is one of the most dangerous rice pathogens, causing considerable crop losses around the world. In order to explore the rice blast-resistant sources, initially performed a large-scale screening of 277 rice accessions. In parallel with field evaluations, fifty-two rice accessions were genotyped for 25 major blast resistance genes utilizing functional/gene-based markers based on their reactivity against rice blast disease. According to the phenotypic examination, 29 (58%) and 22 (42%) entries were found to be highly resistant, 18 (36%) and 29 (57%) showed moderate resistance, and 05 (6%) and 01 (1%), respectively, were highly susceptible to leaf and neck blast. The genetic frequency of 25 major blast resistance genes ranged from 32 to 60%, with two genotypes having a maximum of 16 R-genes each. The 52 rice accessions were divided into two groups based on cluster and population structure analysis. The highly resistant and moderately resistant accessions are divided into different groups using the principal coordinate analysis. According to the analysis of molecular variance, the maximum diversity was found within the population, while the minimum diversity was found between the populations. Two markers (RM5647 and K39512), which correspond to the blast-resistant genes Pi36 and Pik, respectively, showed a significant association to the neck blast disease, whereas three markers (Pi2-i, Pita3, and k2167), which correspond to the blast-resistant genes Pi2, Pita/Pita2, and Pikm, respectively, showed a significant association to the leaf blast disease. The associated R-genes might be utilized in rice breeding programmes through marker-assisted breeding, and the identified resistant rice accessions could be used as prospective donors for the production of new resistant varieties in India and around the world.
Collapse
Affiliation(s)
- Jeevan B
- ICAR-Vivekananda Parvatiya Krishi Anusandhan Sansthan, Almora, Uttarakhand, India
| | | | - Prasanna S Koti
- The University of Trans-Disciplinary Health Sciences and Technology, Jarakabande Kaval, Bengaluru, Karnataka, India
| | | | - Umakanta Ngangkham
- ICAR- Research Complex for North- Eastern Hill Region, Manipur centre, Imphal, Manipur, India
| | - Pramesh Devanna
- Rice Pathology Laboratory, AICRIP, Gangavathi, University of Agricultural Sciences, Raichur, Karnataka, India
| | - Manoj Kumar Yadav
- ICAR-Indian Agricultural Research Institute, Regional Station, Karnal, Haryana, India
| | - Krishna Kant Mishra
- ICAR-Vivekananda Parvatiya Krishi Anusandhan Sansthan, Almora, Uttarakhand, India
| | - Jay Prakash Aditya
- ICAR-Vivekananda Parvatiya Krishi Anusandhan Sansthan, Almora, Uttarakhand, India
| | - Palanna Kaki Boraiah
- Project Coordinating Unit, ICAR-AICRP on Small Millets, UAS, GKVK, Bengaluru, Karnataka, India
| | - Ahmed Gaber
- Department of Biology, College of Science, Taif University, Taif, Saudi Arabia
| | - Akbar Hossain
- Department of Agronomy, Bangladesh Wheat and Maize Research Institute, Dinajpur, Bangladesh
| |
Collapse
|
5
|
Pudjihartono N, Fadason T, Kempa-Liehr AW, O'Sullivan JM. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. FRONTIERS IN BIOINFORMATICS 2022; 2:927312. [PMID: 36304293 PMCID: PMC9580915 DOI: 10.3389/fbinf.2022.927312] [Citation(s) in RCA: 168] [Impact Index Per Article: 56.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Accepted: 06/03/2022] [Indexed: 01/14/2023] Open
Abstract
Machine learning has shown utility in detecting patterns within large, unstructured, and complex datasets. One of the promising applications of machine learning is in precision medicine, where disease risk is predicted using patient genetic data. However, creating an accurate prediction model based on genotype data remains challenging due to the so-called “curse of dimensionality” (i.e., extensively larger number of features compared to the number of samples). Therefore, the generalizability of machine learning models benefits from feature selection, which aims to extract only the most “informative” features and remove noisy “non-informative,” irrelevant and redundant features. In this article, we provide a general overview of the different feature selection methods, their advantages, disadvantages, and use cases, focusing on the detection of relevant features (i.e., SNPs) for disease risk prediction.
Collapse
Affiliation(s)
| | - Tayaza Fadason
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
| | - Andreas W. Kempa-Liehr
- Department of Engineering Science, The University of Auckland, Auckland, New Zealand
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| | - Justin M. O'Sullivan
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
- MRC Lifecourse Epidemiology Unit, University of Southampton, Southampton, United Kingdom
- Singapore Institute for Clinical Sciences, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Australian Parkinson’s Mission, Garvan Institute of Medical Research, Sydney, NSW, Australia
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| |
Collapse
|
6
|
Use of a graph neural network to the weighted gene co-expression network analysis of Korean native cattle. Sci Rep 2022; 12:9854. [PMID: 35701465 PMCID: PMC9197844 DOI: 10.1038/s41598-022-13796-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Accepted: 05/27/2022] [Indexed: 11/25/2022] Open
Abstract
In the general framework of the weighted gene co-expression network analysis (WGCNA), a hierarchical clustering algorithm is commonly used to module definition. However, hierarchical clustering depends strongly on the topological overlap measure. In other words, this algorithm may assign two genes with low topological overlap to different modules even though their expression patterns are similar. Here, a novel gene module clustering algorithm for WGCNA is proposed. We develop a gene module clustering network (gmcNet), which simultaneously addresses single-level expression and topological overlap measure. The proposed gmcNet includes a “co-expression pattern recognizer” (CEPR) and “module classifier”. The CEPR incorporates expression features of single genes into the topological features of co-expressed ones. Given this CEPR-embedded feature, the module classifier computes module assignment probabilities. We validated gmcNet performance using 4,976 genes from 20 native Korean cattle. We observed that the CEPR generates more robust features than single-level expression or topological overlap measure. Given the CEPR-embedded feature, gmcNet achieved the best performance in terms of modularity (0.261) and the differentially expressed signal (27.739) compared with other clustering methods tested. Furthermore, gmcNet detected some interesting biological functionalities for carcass weight, backfat thickness, intramuscular fat, and beef tenderness of Korean native cattle. Therefore, gmcNet is a useful framework for WGCNA module clustering.
Collapse
|
7
|
Zou S, Tang Y, Xu Y, Ji J, Lu Y, Wang H, Li Q, Tang D. TuRLK1, a leucine-rich repeat receptor-like kinase, is indispensable for stripe rust resistance of YrU1 and confers broad resistance to multiple pathogens. BMC PLANT BIOLOGY 2022; 22:280. [PMID: 35676630 PMCID: PMC9175386 DOI: 10.1186/s12870-022-03679-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Accepted: 06/03/2022] [Indexed: 05/26/2023]
Abstract
BACKGROUND YrU1 is a nucleotide-binding site (NBS) and leucine-rich repeat (LRR) protein (NLR), with additional ankyrin-repeat and WRKY domains and confers effective resistance to stripe rust fungus Puccinia striiformis f. sp. Tritici (Pst). YrU1 was positionally cloned in the progenitor species of the A genome of bread wheat, Tricicum urartu, recently. However, the molecular mechanism and components involved in YrU1-mediated resistance are not clear. RESULTS In this study, we found that the transcript level of TuRLK1, which encodes a novel leucine-rich repeat receptor-like kinase, was up-regulated after inoculation with Pst in the presence of YrU1, through RNA-seq analysis in T. urartu accession PI428309. TuRLK1 contained only a small number of LRR motifs, and was localized in the plasma-membrane. Transient expression of TuRLK1 induced hypersensitive cell death response in N. benthamiana leaves. Silencing of TuRLK1, using barley stripe mosaic virus (BSMV)-induced gene silencing (VIGS) system in PI428309 that contains YrU1, compromised the resistance against stripe rust caused by Pst CY33, indicating that TuRLK1 was required for YrU1-activated plant immunity. Furthermore, overexpression of TuRLK1 could enhance powdery mildew resistance in bread wheat and Arabidopsis thaliana after inoculating with the corresponding pathogens. CONCLUSIONS Our study indicates that TuRLK1 is required for immune response mediated by the unique NLR protein YrU1, and likely plays an important role in disease resistance to other pathogens.
Collapse
Affiliation(s)
- Shenghao Zou
- State Key Laboratory of Ecological Control of Fujian-Taiwan Crop Pests, Key Laboratory of Ministry of Education for Genetics, Breeding and Multiple Utilization of Crops, Plant Immunity Center, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Yansheng Tang
- State Key Laboratory of Ecological Control of Fujian-Taiwan Crop Pests, Key Laboratory of Ministry of Education for Genetics, Breeding and Multiple Utilization of Crops, Plant Immunity Center, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Yang Xu
- State Key Laboratory of Ecological Control of Fujian-Taiwan Crop Pests, Key Laboratory of Ministry of Education for Genetics, Breeding and Multiple Utilization of Crops, Plant Immunity Center, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Jiahao Ji
- State Key Laboratory of Ecological Control of Fujian-Taiwan Crop Pests, Key Laboratory of Ministry of Education for Genetics, Breeding and Multiple Utilization of Crops, Plant Immunity Center, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Yuanyuan Lu
- State Key Laboratory of Ecological Control of Fujian-Taiwan Crop Pests, Key Laboratory of Ministry of Education for Genetics, Breeding and Multiple Utilization of Crops, Plant Immunity Center, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Huanming Wang
- State Key Laboratory of Ecological Control of Fujian-Taiwan Crop Pests, Key Laboratory of Ministry of Education for Genetics, Breeding and Multiple Utilization of Crops, Plant Immunity Center, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Qianqian Li
- State Key Laboratory of Ecological Control of Fujian-Taiwan Crop Pests, Key Laboratory of Ministry of Education for Genetics, Breeding and Multiple Utilization of Crops, Plant Immunity Center, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Dingzhong Tang
- State Key Laboratory of Ecological Control of Fujian-Taiwan Crop Pests, Key Laboratory of Ministry of Education for Genetics, Breeding and Multiple Utilization of Crops, Plant Immunity Center, Fujian Agriculture and Forestry University, Fuzhou, 350002, China.
| |
Collapse
|
8
|
Application of Systems Engineering Principles and Techniques in Biological Big Data Analytics: A Review. Processes (Basel) 2020. [DOI: 10.3390/pr8080951] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
In the past few decades, we have witnessed tremendous advancements in biology, life sciences and healthcare. These advancements are due in no small part to the big data made available by various high-throughput technologies, the ever-advancing computing power, and the algorithmic advancements in machine learning. Specifically, big data analytics such as statistical and machine learning has become an essential tool in these rapidly developing fields. As a result, the subject has drawn increased attention and many review papers have been published in just the past few years on the subject. Different from all existing reviews, this work focuses on the application of systems, engineering principles and techniques in addressing some of the common challenges in big data analytics for biological, biomedical and healthcare applications. Specifically, this review focuses on the following three key areas in biological big data analytics where systems engineering principles and techniques have been playing important roles: the principle of parsimony in addressing overfitting, the dynamic analysis of biological data, and the role of domain knowledge in biological data analytics.
Collapse
|
9
|
Konstantinidis AΟ, Pardali D, Adamama-Moraitou KK, Gazouli M, Dovas CI, Legaki E, Brellou GD, Savvas I, Jergens AE, Rallis TS, Allenspach K. Colonic mucosal and serum expression of microRNAs in canine large intestinal inflammatory bowel disease. BMC Vet Res 2020; 16:69. [PMID: 32087719 PMCID: PMC7035774 DOI: 10.1186/s12917-020-02287-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2019] [Accepted: 02/13/2020] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Canine inflammatory bowel disease (IBD) is a group of chronic gastrointestinal (GI) disorders of still largely unknown etiology. Canine IBD diagnosis is time-consuming and costly as other diseases with similar signs should be initially excluded. In human IBD microRNA (miR) expression changes have been reported in GI mucosa and blood. Thus, there is a possibility that miRs may provide insight into disease pathogenesis, diagnosis and even treatment of canine IBD. The aim of this study was to determine the colonic mucosal and serum relative expression of a miRs panel in dogs with large intestinal IBD and healthy control dogs. RESULTS Compared to healthy control dogs, dogs with large intestinal IBD showed significantly increased relative expression of miR-16, miR-21, miR-122 and miR-147 in the colonic mucosa and serum, while the relative expression of miR-185, miR-192 and miR-223 was significantly decreased. Relative expression of miR-146a was significantly increased only in the serum of dogs with large intestinal IBD. Furthermore, serum miR-192 and miR-223 relative expression correlated to disease activity and endoscopic score, respectively. CONCLUSION Our data suggest the existence of dysregulated miRs expression patterns in canine IBD and support the potential future use of serum miRs as useful noninvasive biomarkers.
Collapse
Affiliation(s)
- Alexandros Ο Konstantinidis
- Companion Animal Clinic (Medicine Unit), School of Veterinary Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Dimitra Pardali
- Diagnostic Laboratory, School of Veterinary Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Katerina K Adamama-Moraitou
- Companion Animal Clinic (Medicine Unit), School of Veterinary Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Maria Gazouli
- Laboratory of Biology, School of Medicine, National and Kapodistrian University of Athens, Athens, Greece
| | - Chrysostomos I Dovas
- Diagnostic Laboratory, School of Veterinary Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Evangelia Legaki
- Laboratory of Biology, School of Medicine, National and Kapodistrian University of Athens, Athens, Greece
| | - Georgia D Brellou
- Laboratory of Pathology, School of Veterinary Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Ioannis Savvas
- Companion Animal Clinic (Anesthesia and Intensive Care Unit), School of Veterinary Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Albert E Jergens
- Departments of Veterinary Clinical Sciences, Iowa State University College of Veterinary Medicine, Ames, IA, USA
| | - Timoleon S Rallis
- Companion Animal Clinic (Medicine Unit), School of Veterinary Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Karin Allenspach
- Departments of Veterinary Clinical Sciences, Iowa State University College of Veterinary Medicine, Ames, IA, USA.
| |
Collapse
|
10
|
Gaudillo J, Rodriguez JJR, Nazareno A, Baltazar LR, Vilela J, Bulalacao R, Domingo M, Albia J. Machine learning approach to single nucleotide polymorphism-based asthma prediction. PLoS One 2019; 14:e0225574. [PMID: 31800601 PMCID: PMC6892549 DOI: 10.1371/journal.pone.0225574] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Accepted: 11/07/2019] [Indexed: 12/31/2022] Open
Abstract
Machine learning (ML) is poised as a transformational approach uniquely positioned to discover the hidden biological interactions for better prediction and diagnosis of complex diseases. In this work, we integrated ML-based models for feature selection and classification to quantify the risk of individual susceptibility to asthma using single nucleotide polymorphism (SNP). Random forest (RF) and recursive feature elimination (RFE) algorithm were implemented to identify the SNPs with high implication to asthma. K-nearest neighbor (kNN) and support vector machine (SVM) algorithms were trained to classify the identified SNPs whether associated with non-asthmatic or asthmatic samples. Feature selection step showed that RF outperformed RFE and the feature importance score derived from RF was consistently high for a subset of SNPs, indicating the robustness of RF in selecting relevant features associated with asthma. Model comparison showed that the integration of RF-SVM obtained the highest model performance with an accuracy, precision, and sensitivity of 62.5%, 65.3%, and 69%, respectively, when compared to the baseline, RF-kNN, and an external MeanDiff-kNN models. Furthermore, results show that the occurrence of asthma can be predicted with an Area under the Curve (AUC) of 0.62 and 0.64 for RF-SVM and RF-kNN models, respectively. This study demonstrates the integration of ML models to augment traditional methods in predicting genetic predisposition to multifactorial diseases such as asthma.
Collapse
Affiliation(s)
- Joverlyn Gaudillo
- Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, Philippines
- Computational Interdisciplinary Research Laboratories (CINTERLabs), University of the Philippines Los Baños, Philippines
| | - Jae Joseph Russell Rodriguez
- Genetics and Molecular Biology Division, Institute of Biological Sciences, University of the Philippines Los Baños, Philippines
| | - Allen Nazareno
- Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, Philippines
| | - Lei Rigi Baltazar
- Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, Philippines
- Computational Interdisciplinary Research Laboratories (CINTERLabs), University of the Philippines Los Baños, Philippines
| | - Julianne Vilela
- Philippine Genome Center Program for Agriculture, Office of the Vice Chancellor for Research and Extension, University of the Philippines Los Baños, Philippines
| | - Rommel Bulalacao
- Domingo Artificial Intelligence Research Center, Los Baños, Philippines
| | - Mario Domingo
- Domingo Artificial Intelligence Research Center, Los Baños, Philippines
| | - Jason Albia
- Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, Philippines
- Computational Interdisciplinary Research Laboratories (CINTERLabs), University of the Philippines Los Baños, Philippines
| |
Collapse
|
11
|
König IR. Presidential address: Six open questions to genetic epidemiologists. Genet Epidemiol 2019; 43:242-249. [PMID: 30659680 PMCID: PMC6590280 DOI: 10.1002/gepi.22191] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Revised: 12/18/2018] [Accepted: 01/06/2019] [Indexed: 01/03/2023]
Abstract
Given the rapid pace with which genomics and other ‐omics disciplines are evolving, it is sometimes necessary to shift down a gear to consider more general scientific questions. In this line, in my presidential address I formulate six questions for genetic epidemiologists to ponder on. These cover the areas of reproducibility, statistical significance, chance findings, precision medicine and related fields such as bioinformatics and data science. Possible hints at responses are presented to foster our further discussion of these topics.
Collapse
Affiliation(s)
- Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Lübeck, Germany
| |
Collapse
|
12
|
Blangero J, Teslovich TM, Sim X, Almeida MA, Jun G, Dyer TD, Johnson M, Peralta JM, Manning A, Wood AR, Fuchsberger C, Kent JW, Aguilar DA, Below JE, Farook VS, Arya R, Fowler S, Blackwell TW, Puppala S, Kumar S, Glahn DC, Moses EK, Curran JE, Thameem F, Jenkinson CP, DeFronzo RA, Lehman DM, Hanis C, Abecasis G, Boehnke M, Göring H, Duggirala R, Almasy L. Omics-squared: human genomic, transcriptomic and phenotypic data for genetic analysis workshop 19. BMC Proc 2016; 10:71-77. [PMID: 27980614 PMCID: PMC5133484 DOI: 10.1186/s12919-016-0008-y] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background The Genetic Analysis Workshops (GAW) are a forum for development, testing, and comparison of statistical genetic methods and software. Each contribution to the workshop includes an application to a specified data set. Here we describe the data distributed for GAW19, which focused on analysis of human genomic and transcriptomic data. Methods GAW19 data were donated by the T2D-GENES Consortium and the San Antonio Family Heart Study and included whole genome and exome sequences for odd-numbered autosomes, measures of gene expression, systolic and diastolic blood pressures, and related covariates in two Mexican American samples. These two samples were a collection of 20 large families with whole genome sequence and transcriptomic data and a set of 1943 unrelated individuals with exome sequence. For each sample, simulated phenotypes were constructed based on the real sequence data. ‘Functional’ genes and variants for the simulations were chosen based on observed correlations between gene expression and blood pressure. The simulations focused primarily on additive genetic models but also included a genotype-by-medication interaction. A total of 245 genes were designated as ‘functional’ in the simulations with a few genes of large effect and most genes explaining < 1 % of the trait variation. An additional phenotype, Q1, was simulated to be correlated among related individuals, based on theoretical or empirical kinship matrices, but was not associated with any sequence variants. Two hundred replicates of the phenotypes were simulated. The GAW19 data are an expansion of the data used at GAW18, which included the family-based whole genome sequence, blood pressure, and simulated phenotypes, but not the gene expression data or the set of 1943 unrelated individuals with exome sequence.
Collapse
Affiliation(s)
- John Blangero
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Harlingen, TX 78550 USA
| | - Tanya M Teslovich
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109 USA
| | - Xueling Sim
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109 USA
| | - Marcio A Almeida
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Harlingen, TX 78550 USA
| | - Goo Jun
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109 USA ; Department of Epidemiology, Human Genetics and Environmenal Sciences, University of Texas Health Science Center at Houston, Houston, TX 77030 USA
| | - Thomas D Dyer
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Harlingen, TX 78550 USA
| | - Matthew Johnson
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Harlingen, TX 78550 USA
| | - Juan M Peralta
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Harlingen, TX 78550 USA
| | - Alisa Manning
- Department of Genetics, Massachusetts General Hospital, Boston, MA 02114 USA
| | - Andrew R Wood
- Genetics of Complex Traits, Peninsula College of Medicine and Dentistry, University of Exeter, Exeter, UK
| | - Christian Fuchsberger
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109 USA
| | - Jack W Kent
- Department of Genetics, Texas Biomedical Research Institute, 7620 NW Loop 410, San Antonio, TX 78227 USA
| | - David A Aguilar
- Cardiovascular Division, Baylor College of Medicine, Houston, TX 77030 USA
| | - Jennifer E Below
- Department of Epidemiology, Human Genetics and Environmenal Sciences, University of Texas Health Science Center at Houston, Houston, TX 77030 USA
| | - Vidya S Farook
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Harlingen, TX 78550 USA
| | - Rector Arya
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Harlingen, TX 78550 USA
| | - Sharon Fowler
- Division of Clinical Epidemiology, Department of Medicine, University of San Antonio Health Science Center at San Antonio, San Antonio, TX 78229 USA
| | - Tom W Blackwell
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109 USA
| | - Sobha Puppala
- Department of Genetics, Texas Biomedical Research Institute, 7620 NW Loop 410, San Antonio, TX 78227 USA
| | - Satish Kumar
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Harlingen, TX 78550 USA
| | - David C Glahn
- Department of Psychiatry, Yale University, New Haven, CT 06106 USA
| | - Eric K Moses
- Centre for Genetic Origins of Health and Disease, University of Western Australia, Crawley, Australia
| | - Joanne E Curran
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Harlingen, TX 78550 USA
| | - Farook Thameem
- Department of Biochemistry, Faculty of Medicine, Kuwait University, Safat, Kuwait City, 13110 Kuwait
| | - Christopher P Jenkinson
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Harlingen, TX 78550 USA
| | - Ralph A DeFronzo
- Texas Diabetes Institute, University of San Antonio Health Science Center at San Antonio, San Antonio, TX 78229 USA
| | - Donna M Lehman
- Division of Clinical Epidemiology, Department of Medicine, University of San Antonio Health Science Center at San Antonio, San Antonio, TX 78229 USA
| | - Craig Hanis
- Department of Epidemiology, Human Genetics and Environmenal Sciences, University of Texas Health Science Center at Houston, Houston, TX 77030 USA
| | - Goncalo Abecasis
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109 USA
| | - Michael Boehnke
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109 USA
| | - Harald Göring
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Harlingen, TX 78550 USA
| | - Ravindranath Duggirala
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Harlingen, TX 78550 USA
| | - Laura Almasy
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Harlingen, TX 78550 USA ; Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104 USA
| | | |
Collapse
|
13
|
Engelman CD, Greenwood CMT, Bailey JN, Cantor RM, Kent JW, König IR, Bermejo JL, Melton PE, Santorico SA, Schillert A, Wijsman EM, MacCluer JW, Almasy L. Genetic Analysis Workshop 19: methods and strategies for analyzing human sequence and gene expression data in extended families and unrelated individuals. BMC Proc 2016; 10:67-70. [PMID: 27980613 PMCID: PMC5133501 DOI: 10.1186/s12919-016-0007-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Genetic Analysis Workshop 19 provided a platform for developing and evaluating statistical methods to analyze whole-genome sequence and gene expression data from a pedigree-based sample, as well as whole-exome sequence data from a large cohort of unrelated individuals. In this article we present an overview of the data sets, the GAW experience, and summaries of the contributions arranged into nine methodological themes.
Collapse
Affiliation(s)
- Corinne D. Engelman
- Department of Population Health Sciences, School of Medicine and Public Health, University of Wisconsin, 610 Walnut Street, 707 WARF, Madison, WI 53726 USA
| | - Celia M. T. Greenwood
- Lady Davis Institute for Medical Research, Jewish General Hospital, 3755 Côte Ste. Catherine, Montreal, QC H3T 1E2 Canada
| | - Julia N. Bailey
- Department of Epidemiology, University of California Los Angeles Fielding School of Public Health, Box 951772, Los Angeles, CA 90095 USA
| | - Rita M. Cantor
- Department of Human Genetics, David Geffen School of Medicine at UCLA, 695 Charles E. Young Dr, South, Los Angeles, CA 90024-7088 USA
| | - Jack W. Kent
- Department of Genetics, Texas Biomedical Research Institute, PO Box 760549, San Antonio, TX 78245-0549 USA
| | - Inke R. König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany
| | - Justo Lorenzo Bermejo
- Statistical Genetics Group, Institute of Medical Biometry and Informatics, University of Heidelberg, Im Neuenheimer Feld 305, 69120 Heidelberg, Germany
| | - Phillip E. Melton
- Centre for Genetic Origins of Health and Disease, University of Western Australia, Perth, WA Australia
| | - Stephanie A. Santorico
- Department of Mathematical & Statistical Sciences, University of Colorado-Denver, PO Box 173364, Denver, CO 80204 USA
| | - Arne Schillert
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
| | - Ellen M. Wijsman
- Department of Medicine, Department of Biostatistics, Division of Medical Genetics, University of Washington, Seattle, WA 98195 USA
| | - Jean W. MacCluer
- Department of Genetics, Texas Biomedical Research Institute, PO Box 760549, San Antonio, TX 78245-0549 USA
| | - Laura Almasy
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, San Antonio, TX 78229 USA
| |
Collapse
|