1
|
Gu B, Desai RJ, Lin KJ, Yang J. Probabilistic medical predictions of large language models. NPJ Digit Med 2024; 7:367. [PMID: 39702641 DOI: 10.1038/s41746-024-01366-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Accepted: 12/02/2024] [Indexed: 12/21/2024] Open
Abstract
Large Language Models (LLMs) have shown promise in clinical applications through prompt engineering, allowing flexible clinical predictions. However, they struggle to produce reliable prediction probabilities, which are crucial for transparency and decision-making. While explicit prompts can lead LLMs to generate probability estimates, their numerical reasoning limitations raise concerns about reliability. We compared explicit probabilities from text generation to implicit probabilities derived from the likelihood of predicting the correct label token. Across six advanced open-source LLMs and five medical datasets, explicit probabilities consistently underperformed implicit probabilities in discrimination, precision, and recall. This discrepancy is more pronounced with smaller LLMs and imbalanced datasets, highlighting the need for cautious interpretation, improved probability estimation methods, and further research for clinical use of LLMs.
Collapse
Affiliation(s)
- Bowen Gu
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Rishi J Desai
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Kueiyu Joshua Lin
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Jie Yang
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
2
|
Hoyos W, Hoyos K, Ruiz R, Aguilar J. An explainable analysis of diabetes mellitus using statistical and artificial intelligence techniques. BMC Med Inform Decis Mak 2024; 24:383. [PMID: 39695649 DOI: 10.1186/s12911-024-02810-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Accepted: 12/06/2024] [Indexed: 12/20/2024] Open
Abstract
BACKGROUND Diabetes mellitus (DM) is a chronic disease prevalent worldwide, requiring a multifaceted analytical approach to improve early detection and subsequent mitigation of morbidity and mortality rates. This research aimed to develop an explainable analysis of DM by combining sociodemographic and clinical data with statistical and artificial intelligence (AI) techniques. METHODS Leveraging a small dataset that includes sociodemographic and clinical profiles of diabetic and non-diabetic individuals, we employed a diverse set of statistical and AI models for predictive purposes and assessment of DM risk factors. The statistical tests used were Student's t-test and Chi-square, while the AI techniques were fuzzy cognitive maps (FCM), artificial neural networks (ANN), support vector machines (SVM), and XGBoost. RESULTS Our statistical models facilitated an in-depth exploration of variable associations, while the resulting AI models demonstrated exceptional efficacy in DM classification. In particular, the XGBoost model showed superior performance in accuracy, sensitivity and specificity with values of 1 for each of these metrics. On the other hand, the FCM stood out for its explainability capabilities by allowing an analysis of the variables involved in the prediction using scenario-based simulations. CONCLUSIONS An integrated analysis of DM using a variety of methodologies is critical for timely detection of the disease and informed clinical decision-making.
Collapse
Affiliation(s)
- William Hoyos
- Grupo de Investigación ISI, Universidad Cooperativa de Colombia, Montería, Colombia.
- Grupo de Investigación en I+D+i en TIC, Universidad EAFIT, Medellín, Colombia.
- GIMBIC, Universidad de Córdoba, Montería, Colombia.
| | - Kenia Hoyos
- Laboratorio Clínico Humano, Clínica Salud Social, Sincelejo, Colombia
| | - Rander Ruiz
- Grupo de Investigación Interdisciplinario del Bajo Cauca y Sur de Córdoba, Universidad de Antioquia, Medellín, Colombia
| | - Jose Aguilar
- Grupo de Investigación en I+D+i en TIC, Universidad EAFIT, Medellín, Colombia
- CEMISID, Universidad de Los Andes, Merida, Venezuela
- IMDEA Networks Institute, Madrid, Spain
| |
Collapse
|
3
|
Chuwdhury GS, Guo Y, Chiang CL, Lam KO, Kam NW, Liu Z, Dai W. ImmuneMirror: A machine learning-based integrative pipeline and web server for neoantigen prediction. Brief Bioinform 2024; 25:bbae024. [PMID: 38343325 PMCID: PMC10859690 DOI: 10.1093/bib/bbae024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 12/05/2023] [Accepted: 01/16/2024] [Indexed: 02/15/2024] Open
Abstract
Neoantigens are derived from somatic mutations in the tumors but are absent in normal tissues. Emerging evidence suggests that neoantigens can stimulate tumor-specific T-cell-mediated antitumor immune responses, and therefore are potential immunotherapeutic targets. We developed ImmuneMirror as a stand-alone open-source pipeline and a web server incorporating a balanced random forest model for neoantigen prediction and prioritization. The prediction model was trained and tested using known immunogenic neopeptides collected from 19 published studies. The area under the curve of our trained model was 0.87 based on the testing data. We applied ImmuneMirror to the whole-exome sequencing and RNA sequencing data obtained from gastrointestinal tract cancers including 805 tumors from colorectal cancer (CRC), esophageal squamous cell carcinoma (ESCC) and hepatocellular carcinoma patients. We discovered a subgroup of microsatellite instability-high (MSI-H) CRC patients with a low neoantigen load but a high tumor mutation burden (> 10 mutations per Mbp). Although the efficacy of PD-1 blockade has been demonstrated in advanced MSI-H patients, almost half of such patients do not respond well. Our study identified a subset of MSI-H patients who may not benefit from this treatment with lower neoantigen load for major histocompatibility complex I (P < 0.0001) and II (P = 0.0008) molecules, respectively. Additionally, the neopeptide YMCNSSCMGV-TP53G245V, derived from a hotspot mutation restricted by HLA-A02, was identified as a potential actionable target in ESCC. This is so far the largest study to comprehensively evaluate neoantigen prediction models using experimentally validated neopeptides. Our results demonstrate the reliability and effectiveness of ImmuneMirror for neoantigen prediction.
Collapse
Affiliation(s)
- Gulam Sarwar Chuwdhury
- Department of Clinical Oncology, Center of Cancer Medicine, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, University of Hong Kong, Hong Kong (SAR), P. R. China
| | - Yunshan Guo
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| | - Chi-Leung Chiang
- Department of Clinical Oncology, Center of Cancer Medicine, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, University of Hong Kong, Hong Kong (SAR), P. R. China
| | - Ka-On Lam
- Department of Clinical Oncology, Center of Cancer Medicine, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, University of Hong Kong, Hong Kong (SAR), P. R. China
| | - Ngar-Woon Kam
- Department of Clinical Oncology, Center of Cancer Medicine, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, University of Hong Kong, Hong Kong (SAR), P. R. China
- Laboratory for Synthetic Chemistry and Chemical Biology Limited, Hong Kong Science Park, Shatin, Hong Kong
| | - Zhonghua Liu
- Department of Biostatistics, Columbia University, New York, NY, USA
| | - Wei Dai
- Department of Clinical Oncology, Center of Cancer Medicine, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, University of Hong Kong, Hong Kong (SAR), P. R. China
- University of Hong Kong-Shenzhen Hospital, Shenzhen, P. R. China
| |
Collapse
|
4
|
Son Y, Chung J. Risk Factor Analysis of Cryopreserved Autologous Bone Flap Resorption in Adult Patients Undergoing Cranioplasty with Volumetry Measurement Using Conventional Statistics and Machine-Learning Technique. J Korean Neurosurg Soc 2024; 67:103-114. [PMID: 37709548 PMCID: PMC10788544 DOI: 10.3340/jkns.2023.0143] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 08/29/2023] [Accepted: 09/13/2023] [Indexed: 09/16/2023] Open
Abstract
OBJECTIVE Decompressive craniectomy (DC) with duroplasty is one of the common surgical treatments for life-threatening increased intracranial pressure (ICP). Once ICP is controlled, cranioplasty (CP) with reinsertion of the cryopreserved autologous bone flap or a synthetic implant is considered for protection and esthetics. Although with the risk of autologous bone flap resorption (BFR), cryopreserved autologous bone flap for CP is one of the important material due to its cost effectiveness. In this article, we performed conventional statistical analysis and the machine learning technique understand the risk factors for BFR. METHODS Patients aged >18 years who underwent autologous bone CP between January 2015 and December 2021 were reviewed. Demographic data, medical records, and volumetric measurements of the autologous bone flap volume from 94 patients were collected. BFR was defined with absolute quantitative method (BFR-A) and relative quantitative method (BFR%). Conventional statistical analysis and random forest with hyper-ensemble approach (RF with HEA) was performed. And overlapped partial dependence plots (PDP) were generated. RESULTS Conventional statistical analysis showed that only the initial autologous bone flap volume was statistically significant on BFR-A. RF with HEA showed that the initial autologous bone flap volume, interval between DC and CP, and bone quality were the factors with most contribution to BFR-A, while, trauma, bone quality, and initial autologous bone flap volume were the factors with most contribution to BFR%. Overlapped PDPs of the initial autologous bone flap volume on the BRF-A crossed at approximately 60 mL, and a relatively clear separation was found between the non-BFR and BFR groups. Therefore, the initial autologous bone flap of over 60 mL could be a possible risk factor for BFR. CONCLUSION From the present study, BFR in patients who underwent CP with autologous bone flap might be inevitable. However, the degree of BFR may differ from one to another. Therefore, considering artificial bone flaps as implants for patients with large DC could be reasonable. Still, the risk factors for BFR are not clearly understood. Therefore, chronological analysis and pathophysiologic studies are needed.
Collapse
Affiliation(s)
- Yohan Son
- Department of Neurosurgery, Dankook University Hospital, Cheonan, Korea
| | - Jaewoo Chung
- Department of Neurosurgery, Dankook University Hospital, Cheonan, Korea
- Department of Neurosurgery, College of Medicine, Dankook University, Cheonan, Korea
| |
Collapse
|
5
|
Chung J, Cheong JH, Kim JM, Lee DH, Yi HJ, Choi KS, Ahn JS, Park JC, Park W. Is Fetal-Type Posterior Cerebral Artery a Risk Factor for Recurrence in Coiled Internal Carotid Artery-Incorporating Posterior Communicating Artery Aneurysms? Analysis of Conventional Statistics, Computational Fluid Dynamics, and Random Forest With Hyper-Ensemble Approach. Neurosurgery 2023; 93:611-621. [PMID: 37057916 DOI: 10.1227/neu.0000000000002458] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 01/20/2023] [Indexed: 04/15/2023] Open
Abstract
BACKGROUND The fetal-type posterior cerebral artery (FPCA) has been regarded as the risk factor for recurrence in coiled internal carotid artery-incorporating posterior communicating artery (ICA-PCoA) aneurysm. However, it has not been proven in previous literature studies. OBJECTIVE To reveal the impact of FPCA on the recurrence of ICA-PCoA aneurysms using conventional statistical analysis, computational fluid dynamics (CFD) simulation, and random forest with hyper-ensemble approach (RF with HEA). METHODS Vascular parameters and clinical information from patients who underwent coil embolization ICA-PCoA aneurysms from January 2011 to December 2016 were obtained. Conventional statistical analysis was applied to a total of 95 cases obtained from patients with a follow-up of more than 6 months. For CFD simulation, 3 sets of three-dimensional models were used to understand the hemodynamical characteristics of various FPCAs. The RF with HEA was applied to reinforce the clinical data analysis. RESULTS The conventional statistical analysis fails to reveal that FPCA is a risk factor. CFD analysis shows that the diameter of FPCA alone is less likely to be a risk factor. The RF with HEA shows that the impact of FPCA is also minor compared with that of the packing density in the recurrence of coiled ICA-PCoA aneurysms. CONCLUSION The gathered results of all 3 analyses show more clear evidence that FPCA is not a risk factor for coiled ICA-PCoA aneurysms. Hence, we may conclude that FPCA itself is doubtful to be the major risk factor in the recurrence of coiled ICA-PCoA aneurysms.
Collapse
Affiliation(s)
- Jaewoo Chung
- Department of Neurosurgery, Dankook University, Cheonan, Republic of Korea
| | - Jin Hwan Cheong
- Department of Neurosurgery, Hanyang University Guri Hospital, Guri, Republic of Korea
| | - Jae Min Kim
- Department of Neurosurgery, Hanyang University Guri Hospital, Guri, Republic of Korea
| | - Deok Hee Lee
- Department of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Hyeong-Joong Yi
- Department of Neurosurgery, Hanyang University Medical Center, Seoul, Republic of Korea
| | - Kyu-Sun Choi
- Department of Neurosurgery, Hanyang University Medical Center, Seoul, Republic of Korea
| | - Jae Sung Ahn
- Department of Neurosurgery, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Jung Cheol Park
- Department of Neurosurgery, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Wonhyoung Park
- Department of Neurosurgery, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea
| |
Collapse
|
6
|
Kosolwattana T, Liu C, Hu R, Han S, Chen H, Lin Y. A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly imbalanced data classification in healthcare. BioData Min 2023; 16:15. [PMID: 37098549 PMCID: PMC10131309 DOI: 10.1186/s13040-023-00330-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Accepted: 03/09/2023] [Indexed: 04/27/2023] Open
Abstract
In many healthcare applications, datasets for classification may be highly imbalanced due to the rare occurrence of target events such as disease onset. The SMOTE (Synthetic Minority Over-sampling Technique) algorithm has been developed as an effective resampling method for imbalanced data classification by oversampling samples from the minority class. However, samples generated by SMOTE may be ambiguous, low-quality and non-separable with the majority class. To enhance the quality of generated samples, we proposed a novel self-inspected adaptive SMOTE (SASMOTE) model that leverages an adaptive nearest neighborhood selection algorithm to identify the "visible" nearest neighbors, which are used to generate samples likely to fall into the minority class. To further enhance the quality of the generated samples, an uncertainty elimination via self-inspection approach is introduced in the proposed SASMOTE model. Its objective is to filter out the generated samples that are highly uncertain and inseparable with the majority class. The effectiveness of the proposed algorithm is compared with existing SMOTE-based algorithms and demonstrated through two real-world case studies in healthcare, including risk gene discovery and fatal congenital heart disease prediction. By generating the higher quality synthetic samples, the proposed algorithm is able to help achieve better prediction performance (in terms of F1 score) on average compared to the other methods, which is promising to enhance the usability of machine learning models on highly imbalanced healthcare data.
Collapse
Affiliation(s)
| | - Chenang Liu
- School of Industrial Engineering & Management, Oklahoma State University, Stillwater, USA
| | - Renjie Hu
- Department of Information and Logistics Technology, University of Houston, Houston, USA
| | - Shizhong Han
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, USA
- Lieber Institute for Brain Development, Baltimore, USA
| | - Hua Chen
- Department of Pharmaceutical Health Outcomes and Policy, University of Houston, Houston, USA
| | - Ying Lin
- Department of Industrial Engineering, University of Houston, Houston, USA.
| |
Collapse
|
7
|
Phase-specific signatures of wound fibroblasts and matrix patterns define cancer-associated fibroblast subtypes. Matrix Biol 2023; 119:19-56. [PMID: 36914141 DOI: 10.1016/j.matbio.2023.03.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Revised: 01/23/2023] [Accepted: 03/02/2023] [Indexed: 03/13/2023]
Abstract
Healing wounds and cancers present remarkable cellular and molecular parallels, but the specific roles of the healing phases are largely unknown. We developed a bioinformatics pipeline to identify genes and pathways that define distinct phases across the time-course of healing. Their comparison to cancer transcriptomes revealed that a resolution phase wound signature is associated with increased severity in skin cancer and enriches for extracellular matrix-related pathways. Comparisons of transcriptomes of early- and late-phase wound fibroblasts vs skin cancer-associated fibroblasts (CAFs) identified an "early wound" CAF subtype, which localizes to the inner tumor stroma and expresses collagen-related genes that are controlled by the RUNX2 transcription factor. A "late wound" CAF subtype localizes to the outer tumor stroma and expresses elastin-related genes. Matrix imaging of primary melanoma tissue microarrays validated these matrix signatures and identified collagen- vs elastin-rich niches within the tumor microenvironment, whose spatial organization predicts survival and recurrence. These results identify wound-regulated genes and matrix patterns with prognostic potential in skin cancer.
Collapse
|
8
|
Kim WP, Kim HJ, Pack SP, Lim JH, Cho CH, Lee HJ. Machine Learning-Based Prediction of Attention-Deficit/Hyperactivity Disorder and Sleep Problems With Wearable Data in Children. JAMA Netw Open 2023; 6:e233502. [PMID: 36930149 PMCID: PMC10024208 DOI: 10.1001/jamanetworkopen.2023.3502] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/18/2023] Open
Abstract
IMPORTANCE Early detection of attention-deficit/hyperactivity disorder (ADHD) and sleep problems is paramount for children's mental health. Interview-based diagnostic approaches have drawbacks, necessitating the development of an evaluation method that uses digital phenotypes in daily life. OBJECTIVE To evaluate the predictive performance of machine learning (ML) models by setting the data obtained from personal digital devices comprising training features (ie, wearable data) and diagnostic results of ADHD and sleep problems by the Kiddie Schedule for Affective Disorders and Schizophrenia Present and Lifetime Version for Diagnostic and Statistical Manual of Mental Disorders, 5th edition (K-SADS) as a prediction class from the Adolescent Brain Cognitive Development (ABCD) study. DESIGN, SETTING, AND PARTICIPANTS In this diagnostic study, wearable data and K-SADS data were collected at 21 sites in the US in the ABCD study (release 3.0, November 2, 2020, analyzed October 11, 2021). Screening data from 6571 patients and 21 days of wearable data from 5725 patients collected at the 2-year follow-up were used, and circadian rhythm-based features were generated for each participant. A total of 12 348 wearable data for ADHD and 39 160 for sleep problems were merged for developing ML models. MAIN OUTCOMES AND MEASURES The average performance of the ML models was measured using an area under the receiver operating characteristics curve (AUC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). In addition, the Shapley Additive Explanations value was used to calculate the importance of features. RESULTS The final population consisted of 79 children with ADHD problems (mean [SD] age, 144.5 [8.1] months; 55 [69.6%] males) vs 1011 controls and 68 with sleep problems (mean [SD] age, 143.5 [7.5] months; 38 [55.9%] males) vs 3346 controls. The ML models showed reasonable predictive performance for ADHD (AUC, 0.798; sensitivity, 0.756; specificity, 0.716; PPV, 0.159; and NPV, 0.976) and sleep problems (AUC, 0.737; sensitivity, 0.743; specificity, 0.632; PPV, 0.036; and NPV, 0.992). CONCLUSIONS AND RELEVANCE In this diagnostic study, an ML method for early detection or screening using digital phenotypes in children's daily lives was developed. The results support facilitating early detection in children; however, additional follow-up studies can improve its performance.
Collapse
Affiliation(s)
- Won-Pyo Kim
- LumanLab Inc, R&D Center, Seoul, South Korea
| | - Hyun-Jin Kim
- Department of Psychiatry, Chungnam National University Sejong Hospital, Sejong, South Korea
| | - Seung Pil Pack
- Department of Biotechnology and Bioinformatics, Korea University, Sejong, South Korea
| | | | - Chul-Hyun Cho
- Department of Psychiatry, Korea University College of Medicine, Seoul, South Korea
- Department of Biomedical Informatics, Korea University College of Medicine, Seoul, South Korea
- Chronobiology Institute, Korea University, Seoul, South Korea
| | - Heon-Jeong Lee
- Department of Psychiatry, Korea University College of Medicine, Seoul, South Korea
- Chronobiology Institute, Korea University, Seoul, South Korea
| |
Collapse
|
9
|
Schubach M, Nazaretyan L, Kircher M. The Regulatory Mendelian Mutation score for GRCh38. Gigascience 2022; 12:giad024. [PMID: 37083939 PMCID: PMC10120424 DOI: 10.1093/gigascience/giad024] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Revised: 01/10/2023] [Accepted: 03/21/2023] [Indexed: 04/22/2023] Open
Abstract
BACKGROUND Genome sequencing efforts for individuals with rare Mendelian disease have increased the research focus on the noncoding genome and the clinical need for methods that prioritize potentially disease causal noncoding variants. Some tools for assessment of variant pathogenicity as well as annotations are not available for the current human genome build (GRCh38), for which the adoption in databases, software, and pipelines was slow. RESULTS Here, we present an updated version of the Regulatory Mendelian Mutation (ReMM) score, retrained on features and variants derived from the GRCh38 genome build. Like its GRCh37 version, it achieves good performance on its highly imbalanced data. To improve accessibility and provide users with a toolbox to score their variant files and look up scores in the genome, we developed a website and API for easy score lookup. CONCLUSIONS Scores of the GRCh38 genome build are highly correlated to the prior release with a performance increase due to the better coverage of features. For prioritization of noncoding mutations in imbalanced datasets, the ReMM score performed much better than other variation scores. Prescored whole-genome files of GRCh37 and GRCh38 genome builds are cited in the article and the website; UCSC genome browser tracks, and an API are available at https://remm.bihealth.org.
Collapse
Affiliation(s)
- Max Schubach
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité–Universitätsmedizin Berlin, 10117 Berlin, Germany
| | - Lusiné Nazaretyan
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité–Universitätsmedizin Berlin, 10117 Berlin, Germany
| | - Martin Kircher
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité–Universitätsmedizin Berlin, 10117 Berlin, Germany
- Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, 23562 Lübeck, Germany
| |
Collapse
|
10
|
Cappelletti L, Petrini A, Gliozzo J, Casiraghi E, Schubach M, Kircher M, Valentini G. Boosting tissue-specific prediction of active cis-regulatory regions through deep learning and Bayesian optimization techniques. BMC Bioinformatics 2022; 23:154. [PMID: 36510125 PMCID: PMC9743524 DOI: 10.1186/s12859-022-04582-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 01/20/2022] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Cis-regulatory regions (CRRs) are non-coding regions of the DNA that fine control the spatio-temporal pattern of transcription; they are involved in a wide range of pivotal processes such as the development of specific cell-lines/tissues and the dynamic cell response to physiological stimuli. Recent studies showed that genetic variants occurring in CRRs are strongly correlated with pathogenicity or deleteriousness. Considering the central role of CRRs in the regulation of physiological and pathological conditions, the correct identification of CRRs and of their tissue-specific activity status through Machine Learning methods plays a major role in dissecting the impact of genetic variants on human diseases. Unfortunately, the problem is still open, though some promising results have been already reported by (deep) machine-learning based methods that predict active promoters and enhancers in specific tissues or cell lines by encoding epigenetic or spectral features directly extracted from DNA sequences. RESULTS We present the experiments we performed to compare two Deep Neural Networks, a Feed-Forward Neural Network model working on epigenomic features, and a Convolutional Neural Network model working only on genomic sequence, targeted to the identification of enhancer- and promoter-activity in specific cell lines. While performing experiments to understand how the experimental setup influences the prediction performance of the methods, we particularly focused on (1) automatic model selection performed by Bayesian optimization and (2) exploring different data rebalancing setups for reducing negative unbalancing effects. CONCLUSIONS Results show that (1) automatic model selection by Bayesian optimization improves the quality of the learner; (2) data rebalancing considerably impacts the prediction performance of the models; test set rebalancing may provide over-optimistic results, and should therefore be cautiously applied; (3) despite working on sequence data, convolutional models obtain performance close to those of feed forward models working on epigenomic information, which suggests that also sequence data carries informative content for CRR-activity prediction. We therefore suggest combining both models/data types in future works.
Collapse
Affiliation(s)
- Luca Cappelletti
- grid.4708.b0000 0004 1757 2822AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Alessandro Petrini
- grid.4708.b0000 0004 1757 2822AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Jessica Gliozzo
- grid.4708.b0000 0004 1757 2822AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Elena Casiraghi
- grid.4708.b0000 0004 1757 2822AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Max Schubach
- grid.6363.00000 0001 2218 4662Berlin Institute of Health at Charité, Universitätsmedizin Berlin, Berlin, Germany
| | - Martin Kircher
- grid.6363.00000 0001 2218 4662Berlin Institute of Health at Charité, Universitätsmedizin Berlin, Berlin, Germany
| | - Giorgio Valentini
- grid.4708.b0000 0004 1757 2822AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy ,European Laboratory for Learning and Intelligent Systems (ELLIS), Berlin, Germany ,CINI National Laboratory of Artificial Intelligence and Intelligent Systems (AIIS), Rome, Italy ,grid.4708.b0000 0004 1757 2822Data Science Research Center, Università degli Studi di Milano, Milan, Italy
| |
Collapse
|
11
|
Manduchi E, Romano JD, Moore JH. The promise of automated machine learning for the genetic analysis of complex traits. Hum Genet 2022; 141:1529-1544. [PMID: 34713318 PMCID: PMC9360157 DOI: 10.1007/s00439-021-02393-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Accepted: 10/22/2021] [Indexed: 12/24/2022]
Abstract
The genetic analysis of complex traits has been dominated by parametric statistical methods due to their theoretical properties, ease of use, computational efficiency, and intuitive interpretation. However, there are likely to be patterns arising from complex genetic architectures which are more easily detected and modeled using machine learning methods. Unfortunately, selecting the right machine learning algorithm and tuning its hyperparameters can be daunting for experts and non-experts alike. The goal of automated machine learning (AutoML) is to let a computer algorithm identify the right algorithms and hyperparameters thus taking the guesswork out of the optimization process. We review the promises and challenges of AutoML for the genetic analysis of complex traits and give an overview of several approaches and some example applications to omics data. It is our hope that this review will motivate studies to develop and evaluate novel AutoML methods and software in the genetics and genomics space. The promise of AutoML is to enable anyone, regardless of training or expertise, to apply machine learning as part of their genetic analysis strategy.
Collapse
Affiliation(s)
- Elisabetta Manduchi
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Joseph D Romano
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Jason H Moore
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA.
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
12
|
A Romero RA, Y Deypalan MN, Mehrotra S, Jungao JT, Sheils NE, Manduchi E, Moore JH. Benchmarking AutoML frameworks for disease prediction using medical claims. BioData Min 2022; 15:15. [PMID: 35883154 PMCID: PMC9327416 DOI: 10.1186/s13040-022-00300-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Accepted: 06/27/2022] [Indexed: 11/10/2022] Open
Abstract
Objectives Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets. Materials and Methods We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics. Results The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications. Discussion Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance. Conclusion Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application. Supplementary Information The online version contains supplementary material available at (10.1186/s13040-022-00300-2).
Collapse
Affiliation(s)
| | | | | | | | | | - Elisabetta Manduchi
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center Suite G540, West Hollywood, 90069, CA, USA
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center Suite G540, West Hollywood, 90069, CA, USA.
| |
Collapse
|
13
|
Cretu I, Tindale A, Abbod M, Khir AW, Mason MJ, Balachandran W, Meng H. Techniques to aid prediction of pacing dependence at 30 days in patients requiring pacemaker implantation after cardiac surgery. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2022; 2022:2647-2650. [PMID: 36085840 DOI: 10.1109/embc48229.2022.9871616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Permanent pacemaker (PPM) implantation occurs in up to 5 % of patients after cardiac surgery but there is little consensus on how long to wait between surgery and PPM insertion. Predicting the likelihood of a patient being pacing dependent 30 days after implant can aid with this timing decision and avoid unnecessary observation time waiting for intrinsic conduction to recover. In this paper, we introduce a new approach for the prediction of PPM dependency at 30 days after implant in patients who have undergone recent cardiac surgery. The aim is to create an automatic detection model able to support clinicians in the decision-making process. We first applied Synthetic Minority Oversampling Technique (SMOTE) and Bayesian Networks (BN) to the dataset, to balance the inherently imbalanced data and create additional synthetic data respectively. The six resultant datasets were then used to train four different classifiers to predict pacing dependence at 30 days, all using the same testing set. The Bagged Trees classifier achieved the best results, reaching an area under the receiver operating curve (AUC) of 90 % in the train phase, and 83 % in the test phase. The overall classification performance was clearly enhanced when using SMOTE and synthetic data created with BN to create a combined and balanced dataset. This technique could be of great use in answering clinical questions where the original dataset is imbalanced.
Collapse
|
14
|
New Developments and Possibilities in Reanalysis and Reinterpretation of Whole Exome Sequencing Datasets for Unsolved Rare Diseases Using Machine Learning Approaches. Int J Mol Sci 2022; 23:ijms23126792. [PMID: 35743235 PMCID: PMC9224427 DOI: 10.3390/ijms23126792] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Revised: 06/13/2022] [Accepted: 06/15/2022] [Indexed: 11/21/2022] Open
Abstract
Rare diseases impact the lives of 300 million people in the world. Rapid advances in bioinformatics and genomic technologies have enabled the discovery of causes of 20–30% of rare diseases. However, most rare diseases have remained as unsolved enigmas to date. Newer tools and availability of high throughput sequencing data have enabled the reanalysis of previously undiagnosed patients. In this review, we have systematically compiled the latest developments in the discovery of the genetic causes of rare diseases using machine learning methods. Importantly, we have detailed methods available to reanalyze existing whole exome sequencing data of unsolved rare diseases. We have identified different reanalysis methodologies to solve problems associated with sequence alterations/mutations, variation re-annotation, protein stability, splice isoform malfunctions and oligogenic analysis. In addition, we give an overview of new developments in the field of rare disease research using whole genome sequencing data and other omics.
Collapse
|
15
|
Vadapalli S, Abdelhalim H, Zeeshan S, Ahmed Z. Artificial intelligence and machine learning approaches using gene expression and variant data for personalized medicine. Brief Bioinform 2022; 23:6590150. [PMID: 35595537 DOI: 10.1093/bib/bbac191] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 04/02/2022] [Accepted: 04/26/2022] [Indexed: 12/16/2022] Open
Abstract
Precision medicine uses genetic, environmental and lifestyle factors to more accurately diagnose and treat disease in specific groups of patients, and it is considered one of the most promising medical efforts of our time. The use of genetics is arguably the most data-rich and complex components of precision medicine. The grand challenge today is the successful assimilation of genetics into precision medicine that translates across different ancestries, diverse diseases and other distinct populations, which will require clever use of artificial intelligence (AI) and machine learning (ML) methods. Our goal here was to review and compare scientific objectives, methodologies, datasets, data sources, ethics and gaps of AI/ML approaches used in genomics and precision medicine. We selected high-quality literature published within the last 5 years that were indexed and available through PubMed Central. Our scope was narrowed to articles that reported application of AI/ML algorithms for statistical and predictive analyses using whole genome and/or whole exome sequencing for gene variants, and RNA-seq and microarrays for gene expression. We did not limit our search to specific diseases or data sources. Based on the scope of our review and comparative analysis criteria, we identified 32 different AI/ML approaches applied in variable genomics studies and report widely adapted AI/ML algorithms for predictive diagnostics across several diseases.
Collapse
Affiliation(s)
- Sreya Vadapalli
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
| | - Habiba Abdelhalim
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
| | - Saman Zeeshan
- Rutgers Cancer Institute of New Jersey, Rutgers University, 195 Little Albany St, New Brunswick, NJ, USA
| | - Zeeshan Ahmed
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA.,Department of Medicine, Robert Wood Johnson Medical School, Rutgers Biomedical and Health Sciences, 125 Paterson St, New Brunswick, NJ, USA
| |
Collapse
|
16
|
A Preliminary Study to Classify Corn Silage for High or Low Mycotoxin Contamination by Using near Infrared Spectroscopy. Toxins (Basel) 2022; 14:toxins14050323. [PMID: 35622570 PMCID: PMC9146547 DOI: 10.3390/toxins14050323] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 04/21/2022] [Accepted: 04/29/2022] [Indexed: 12/30/2022] Open
Abstract
Mycotoxins should be monitored in order to properly evaluate corn silage safety quality. In the present study, corn silage samples (n = 115) were collected in a survey, characterized for concentrations of mycotoxins, and scanned by a NIR spectrometer. Random Forest classification models for NIR calibration were developed by applying different cut-offs to classify samples for concentration (i.e., μg/kg dry matter) or count (i.e., n) of (i) total detectable mycotoxins; (ii) regulated and emerging Fusarium toxins; (iii) emerging Fusarium toxins; (iv) Fumonisins and their metabolites; and (v) Penicillium toxins. An over- and under-sampling re-balancing technique was applied and performed 100 times. The best predictive model for total sum and count (i.e., accuracy mean ± standard deviation) was obtained by applying cut-offs of 10,000 µg/kg DM (i.e., 96.0 ± 2.7%) or 34 (i.e., 97.1 ± 1.8%), respectively. Regulated and emerging Fusarium mycotoxins achieved accuracies slightly less than 90%. For the Penicillium mycotoxin contamination category, an accuracy of 95.1 ± 2.8% was obtained by using a cut-off limit of 350 µg/kg DM as a total sum or 98.6 ± 1.3% for a cut-off limit of five as mycotoxin count. In conclusion, this work was a preliminary study to discriminate corn silage for high or low mycotoxin contamination by using NIR spectroscopy.
Collapse
|
17
|
A New Algorithm for Multivariate Genome Wide Association Studies Based on Differential Evolution and Extreme Learning Machines. MATHEMATICS 2022. [DOI: 10.3390/math10071024] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Genome-wide association studies (GWAS) are observational studies of a large set of genetic variants, whose aim is to find those that are linked to a certain trait or illness. Due to the multivariate nature of these kinds of studies, machine learning methodologies have been already applied in them, showing good performance. This work presents a new methodology for GWAS that makes use of extreme learning machines and differential evolution. The proposed methodology was tested with the help of the genetic information (370,750 single-nucleotide polymorphisms) of 2049 individuals, 1076 of whom suffer from colorectal cancer. The possible relationship of 10 different pathways with this illness was tested. The results achieved showed that the proposed methodology is suitable for detecting relevant pathways for the trait under analysis with a lower computational cost than other machine learning methodologies previously proposed.
Collapse
|
18
|
Andrades R, Recamonde-Mendoza M. Machine learning methods for prediction of cancer driver genes: a survey paper. Brief Bioinform 2022; 23:6551145. [PMID: 35323900 DOI: 10.1093/bib/bbac062] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Revised: 02/06/2022] [Accepted: 02/08/2022] [Indexed: 12/21/2022] Open
Abstract
Identifying the genes and mutations that drive the emergence of tumors is a critical step to improving our understanding of cancer and identifying new directions for disease diagnosis and treatment. Despite the large volume of genomics data, the precise detection of driver mutations and their carrying genes, known as cancer driver genes, from the millions of possible somatic mutations remains a challenge. Computational methods play an increasingly important role in discovering genomic patterns associated with cancer drivers and developing predictive models to identify these elements. Machine learning (ML), including deep learning, has been the engine behind many of these efforts and provides excellent opportunities for tackling remaining gaps in the field. Thus, this survey aims to perform a comprehensive analysis of ML-based computational approaches to identify cancer driver mutations and genes, providing an integrated, panoramic view of the broad data and algorithmic landscape within this scientific problem. We discuss how the interactions among data types and ML algorithms have been explored in previous solutions and outline current analytical limitations that deserve further attention from the scientific community. We hope that by helping readers become more familiar with significant developments in the field brought by ML, we may inspire new researchers to address open problems and advance our knowledge towards cancer driver discovery.
Collapse
Affiliation(s)
- Renan Andrades
- Institute of Informatics, Universidade Federal do Rio Grande do Sul, Porto Alegre/RS, Brazil.,Bioinformatics Core, Hospital de Clínicas de Porto Alegre, Porto Alegre/RS, Brazil
| | - Mariana Recamonde-Mendoza
- Institute of Informatics, Universidade Federal do Rio Grande do Sul, Porto Alegre/RS, Brazil.,Bioinformatics Core, Hospital de Clínicas de Porto Alegre, Porto Alegre/RS, Brazil
| |
Collapse
|
19
|
Predicting Children with ADHD Using Behavioral Activity: A Machine Learning Analysis. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12052737] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Attention deficit hyperactivity disorder (ADHD) is one of childhood’s most frequent neurobehavioral disorders. The purpose of this study is to: (i) extract the most prominent risk factors for children with ADHD; and (ii) propose a machine learning (ML)-based approach to classify children as either having ADHD or healthy. We extracted the data of 45,779 children aged 3–17 years from the 2018–2019 National Survey of Children’s Health (NSCH, 2018–2019). About 5218 (11.4%) of children were ADHD, and the rest of the children were healthy. Since the class label is highly imbalanced, we adopted a combination of oversampling and undersampling approaches to make a balanced class label. We adopted logistic regression (LR) to extract the significant factors for children with ADHD based on p-values (<0.05). Eight ML-based classifiers such as random forest (RF), Naïve Bayes (NB), decision tree (DT), XGBoost, k-nearest neighborhood (KNN), multilayer perceptron (MLP), support vector machine (SVM), and 1-dimensional convolution neural network (1D CNN) were adopted for the prediction of children with ADHD. The average age of the children with ADHD was 12.4 ± 3.4 years. Our findings showed that RF-based classifier provided the highest classification accuracy of 85.5%, sensitivity of 84.4%, specificity of 86.4%, and an AUC of 0.94. This study illustrated that LR with RF-based system could provide excellent accuracy for classifying and predicting children with ADHD. This system will be helpful for early detection and diagnosis of ADHD.
Collapse
|
20
|
Abstract
The scale of genetic, epigenomic, transcriptomic, cheminformatic and proteomic data available today, coupled with easy-to-use machine learning (ML) toolkits, has propelled the application of supervised learning in genomics research. However, the assumptions behind the statistical models and performance evaluations in ML software frequently are not met in biological systems. In this Review, we illustrate the impact of several common pitfalls encountered when applying supervised ML in genomics. We explore how the structure of genomics data can bias performance evaluations and predictions. To address the challenges associated with applying cutting-edge ML methods to genomics, we describe solutions and appropriate use cases where ML modelling shows great potential.
Collapse
|
21
|
Artificial Intelligence and Cardiovascular Genetics. Life (Basel) 2022; 12:life12020279. [PMID: 35207566 PMCID: PMC8875522 DOI: 10.3390/life12020279] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/26/2022] [Accepted: 02/09/2022] [Indexed: 12/13/2022] Open
Abstract
Polygenic diseases, which are genetic disorders caused by the combined action of multiple genes, pose unique and significant challenges for the diagnosis and management of affected patients. A major goal of cardiovascular medicine has been to understand how genetic variation leads to the clinical heterogeneity seen in polygenic cardiovascular diseases (CVDs). Recent advances and emerging technologies in artificial intelligence (AI), coupled with the ever-increasing availability of next generation sequencing (NGS) technologies, now provide researchers with unprecedented possibilities for dynamic and complex biological genomic analyses. Combining these technologies may lead to a deeper understanding of heterogeneous polygenic CVDs, better prognostic guidance, and, ultimately, greater personalized medicine. Advances will likely be achieved through increasingly frequent and robust genomic characterization of patients, as well the integration of genomic data with other clinical data, such as cardiac imaging, coronary angiography, and clinical biomarkers. This review discusses the current opportunities and limitations of genomics; provides a brief overview of AI; and identifies the current applications, limitations, and future directions of AI in genomics.
Collapse
|
22
|
Bhat HS, Reeves ME, Goldman‐Mellor S. Equity‐Weighted Bootstrapping: Examples and Analysis. Stat (Int Stat Inst) 2022. [DOI: 10.1002/sta4.456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Harish S. Bhat
- Applied Mathematics University of California Merced CA USA
| | | | | |
Collapse
|
23
|
Li R, Li L, Xu Y, Yang J. Machine learning meets omics: applications and perspectives. Brief Bioinform 2021; 23:6425809. [PMID: 34791021 DOI: 10.1093/bib/bbab460] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 09/29/2021] [Accepted: 10/07/2021] [Indexed: 02/07/2023] Open
Abstract
The innovation of biotechnologies has allowed the accumulation of omics data at an alarming rate, thus introducing the era of 'big data'. Extracting inherent valuable knowledge from various omics data remains a daunting problem in bioinformatics. Better solutions often need some kind of more innovative methods for efficient handlings and effective results. Recent advancements in integrated analysis and computational modeling of multi-omics data helped address such needs in an increasingly harmonious manner. The development and application of machine learning have largely advanced our insights into biology and biomedicine and greatly promoted the development of therapeutic strategies, especially for precision medicine. Here, we propose a comprehensive survey and discussion on what happened, is happening and will happen when machine learning meets omics. Specifically, we describe how artificial intelligence can be applied to omics studies and review recent advancements at the interface between machine learning and the ever-widest range of omics including genomics, transcriptomics, proteomics, metabolomics, radiomics, as well as those at the single-cell resolution. We also discuss and provide a synthesis of ideas, new insights, current challenges and perspectives of machine learning in omics.
Collapse
Affiliation(s)
- Rufeng Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Lixin Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Yungang Xu
- School of Electronics and Information, Northwestern Polytechnical University, Xi'an, 710129, China
| | - Juan Yang
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China.,Key Laboratory of Environment and Genes Related to Diseases (Xi'an Jiaotong University), Ministry of Education of China, Xi'an 710061, P. R. China
| |
Collapse
|
24
|
Connecting MHC-I-binding motifs with HLA alleles via deep learning. Commun Biol 2021; 4:1194. [PMID: 34663927 PMCID: PMC8523706 DOI: 10.1038/s42003-021-02716-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2021] [Accepted: 09/24/2021] [Indexed: 12/17/2022] Open
Abstract
The selection of peptides presented by MHC molecules is crucial for antigen discovery. Previously, several predictors have shown impressive performance on binding affinity. However, the decisive MHC residues and their relation to the selection of binding peptides are still unrevealed. Here, we connected HLA alleles with binding motifs via our deep learning-based framework, MHCfovea. MHCfovea expanded the knowledge of MHC-I-binding motifs from 150 to 13,008 alleles. After clustering N-terminal and C-terminal sub-motifs on both observed and unobserved alleles, MHCfovea calculated the hyper-motifs and the corresponding allele signatures on the important positions to disclose the relation between binding motifs and MHC-I sequences. MHCfovea delivered 32 pairs of hyper-motifs and allele signatures (HLA-A: 13, HLA-B: 12, and HLA-C: 7). The paired hyper-motifs and allele signatures disclosed the critical polymorphic residues that determine the binding preference, which are believed to be valuable for antigen discovery and vaccine design when allele specificity is concerned. Ko-Han Lee et al. develop MHCfovea, a machine-learning method for predicting peptide-binding by MHC molecules and inferring peptide motifs and MHC allele signatures. They demonstrate that MHCfovea is capable of detecting meaningful hyper-motifs and allele signatures, making it a useful resource for the community.
Collapse
|
25
|
Mosquera-Lopez C, Wan E, Shastry M, Folsom J, Leitschuh J, Condon J, Rajhbeharrysingh U, Hildebrand A, Cameron M, Jacobs PG. Automated Detection of Real-World Falls: Modeled From People With Multiple Sclerosis. IEEE J Biomed Health Inform 2021; 25:1975-1984. [PMID: 33245698 DOI: 10.1109/jbhi.2020.3041035] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Falls are a major health problem with one in three people over the age of 65 falling each year, oftentimes causing hip fractures, disability, reduced mobility, hospitalization and death. A major limitation in fall detection algorithm development is an absence of real-world falls data. Fall detection algorithms are typically trained on simulated fall data that contain a well-balanced number of examples of falls and activities of daily living. However, real-world falls occur infrequently, making them difficult to capture and causing severe data imbalance. People with multiple sclerosis (MS) fall frequently, and their risk of falling increases with disease progression. Because of their high fall incidence, people with MS provide an ideal model for studying falls. This paper describes the development of a context-aware fall detection system based on inertial sensors and time of flight sensors that is robust to imbalance, which is trained and evaluated on real-world falls in people with MS. The algorithm uses an auto-encoder that detects fall candidates using reconstruction error of accelerometer signals followed by a hyper-ensemble of balanced random forests trained using both acceleration and movement features. On a clinical dataset obtained from 25 people with MS monitored over eight weeks during free-living conditions, 54 falls were observed and our system achieved a sensitivity of 92.14%, and false-positive rate of 0.65 false alarms per day.
Collapse
|
26
|
Suleiman M, Abu-Aqil G, Sharaha U, Riesenberg K, Sagi O, Lapidot I, Huleihel M, Salman A. Rapid detection of Klebsiella pneumoniae producing extended spectrum β lactamase enzymes by infrared microspectroscopy and machine learning algorithms. Analyst 2021; 146:1421-1429. [PMID: 33406182 DOI: 10.1039/d0an02182b] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Antimicrobial drugs have played an indispensable role in decreasing morbidity and mortality associated with infectious diseases. However, the resistance of bacteria to a broad spectrum of commonly-used antibiotics has grown to the point of being a global health-care problem. One of the most important classes of multi-drug resistant bacteria is Extended Spectrum Beta-Lactamase-producing (ESBL+) bacteria. This increase in bacterial resistance to antibiotics is mainly due to the long time (about 48 h) that it takes to obtain lab results of detecting ESBL-producing bacteria. Thus, rapid detection of ESBL+ bacteria is highly important for efficient treatment of bacterial infections. In this study, we evaluated the potential of infrared microspectroscopy in tandem with machine learning algorithms for rapid detection of ESBL-producing Klebsiella pneumoniae (K. pneumoniae) obtained from samples of patients with urinary tract infections. 285 ESBL+ and 365 ESBL-K. pneumoniae samples, gathered from cultured colonies, were examined. Our results show that it is possible to determine that K. pneumoniae is ESBL+ with ∼89% accuracy, ∼88% sensitivity and ∼89% specificity, in a time span of ∼20 minutes following the initial culture.
Collapse
Affiliation(s)
- Manal Suleiman
- Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel.
| | - George Abu-Aqil
- Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel.
| | - Uraib Sharaha
- Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel.
| | | | - Orli Sagi
- Director of Microbiology Laboratory, Soroka University Medical Center, Beer-Sheva 84105, Israel
| | - Itshak Lapidot
- Department of Electrical and Electronics Engineering, ACLP-Afeka Center for Language Processing, Afeka Tel-Aviv Academic College of Engineering, Tel-Aviv 69107, Israel
| | - Mahmoud Huleihel
- Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel.
| | - Ahmad Salman
- Department of Physics, SCE - Shamoon College of Engineering, Beer-Sheva 84100, Israel.
| |
Collapse
|
27
|
Dai X, Fu G, Zhao S, Zeng Y. Statistical Learning Methods Applicable to Genome-Wide Association Studies on Unbalanced Case-Control Disease Data. Genes (Basel) 2021; 12:genes12050736. [PMID: 34068248 PMCID: PMC8153154 DOI: 10.3390/genes12050736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 05/01/2021] [Accepted: 05/10/2021] [Indexed: 11/30/2022] Open
Abstract
Despite the fact that imbalance between case and control groups is prevalent in genome-wide association studies (GWAS), it is often overlooked. This imbalance is getting more significant and urgent as the rapid growth of biobanks and electronic health records have enabled the collection of thousands of phenotypes from large cohorts, in particular for diseases with low prevalence. The unbalanced binary traits pose serious challenges to traditional statistical methods in terms of both genomic selection and disease prediction. For example, the well-established linear mixed models (LMM) yield inflated type I error rates in the presence of unbalanced case-control ratios. In this article, we review multiple statistical approaches that have been developed to overcome the inaccuracy caused by the unbalanced case-control ratio, with the advantages and limitations of each approach commented. In addition, we also explore the potential for applying several powerful and popular state-of-the-art machine-learning approaches, which have not been applied to the GWAS field yet. This review paves the way for better analysis and understanding of the unbalanced case-control disease data in GWAS.
Collapse
|
28
|
Coates JTT, Pirovano G, El Naqa I. Radiomic and radiogenomic modeling for radiotherapy: strategies, pitfalls, and challenges. J Med Imaging (Bellingham) 2021; 8:031902. [PMID: 33768134 PMCID: PMC7985651 DOI: 10.1117/1.jmi.8.3.031902] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2020] [Accepted: 01/12/2021] [Indexed: 12/14/2022] Open
Abstract
The power of predictive modeling for radiotherapy outcomes has historically been limited by an inability to adequately capture patient-specific variabilities; however, next-generation platforms together with imaging technologies and powerful bioinformatic tools have facilitated strategies and provided optimism. Integrating clinical, biological, imaging, and treatment-specific data for more accurate prediction of tumor control probabilities or risk of radiation-induced side effects are high-dimensional problems whose solutions could have widespread benefits to a diverse patient population-we discuss technical approaches toward this objective. Increasing interest in the above is specifically reflected by the emergence of two nascent fields, which are distinct but complementary: radiogenomics, which broadly seeks to integrate biological risk factors together with treatment and diagnostic information to generate individualized patient risk profiles, and radiomics, which further leverages large-scale imaging correlates and extracted features for the same purpose. We review classical analytical and data-driven approaches for outcomes prediction that serve as antecedents to both radiomic and radiogenomic strategies. Discussion then focuses on uses of conventional and deep machine learning in radiomics. We further consider promising strategies for the harmonization of high-dimensional, heterogeneous multiomics datasets (panomics) and techniques for nonparametric validation of best-fit models. Strategies to overcome common pitfalls that are unique to data-intensive radiomics are also discussed.
Collapse
Affiliation(s)
- James T. T. Coates
- Massachusetts General Hospital & Harvard Medical School, Center for Cancer Research, Boston, Massachusetts, United States
| | - Giacomo Pirovano
- Memorial Sloan Kettering Cancer Center, Department of Radiology, New York, New York, United States
| | - Issam El Naqa
- Moffitt Cancer Center and Research Institute, Department of Machine Learning, Tampa, Florida, United States
| |
Collapse
|
29
|
Sun Z, Yin H, Chen H, Chen T, Cui L, Yang F. Disease Prediction via Graph Neural Networks. IEEE J Biomed Health Inform 2021; 25:818-826. [PMID: 32749976 DOI: 10.1109/jbhi.2020.3004143] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
With the increasingly available electronic medical records (EMRs), disease prediction has recently gained immense research attention, where an accurate classifier needs to be trained to map the input prediction signals (e.g., symptoms, patient demographics, etc.) to the estimated diseases for each patient. However, existing machine learning-based solutions heavily rely on abundant manually labeled EMR training data to ensure satisfactory prediction results, impeding their performance in the existence of rare diseases that are subject to severe data scarcity. For each rare disease, the limited EMR data can hardly offer sufficient information for a model to correctly distinguish its identity from other diseases with similar clinical symptoms. Furthermore, most existing disease prediction approaches are based on the sequential EMRs collected for every patient and are unable to handle new patients without historical EMRs, reducing their real-life practicality. In this paper, we introduce an innovative model based on Graph Neural Networks (GNNs) for disease prediction, which utilizes external knowledge bases to augment the insufficient EMR data, and learns highly representative node embeddings for patients, diseases and symptoms from the medical concept graph and patient record graph respectively constructed from the medical knowledge base and EMRs. By aggregating information from directly connected neighbor nodes, the proposed neural graph encoder can effectively generate embeddings that capture knowledge from both data sources, and is able to inductively infer the embeddings for a new patient based on the symptoms reported in her/his EMRs to allow for accurate prediction on both general diseases and rare diseases. Extensive experiments on a real-world EMR dataset have demonstrated the state-of-the-art performance of our proposed model.
Collapse
|
30
|
Prasad A, Bhargava H, Gupta A, Shukla N, Rajagopal S, Gupta S, Sharma A, Valadi J, Nigam V, Suravajhala P. Next Generation Sequencing. Adv Bioinformatics 2021. [DOI: 10.1007/978-981-33-6191-1_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
|
31
|
Elgart M, Redline S, Sofer T. Machine and Deep Learning in Molecular and Genetic Aspects of Sleep Research. Neurotherapeutics 2021; 18:228-243. [PMID: 33829409 PMCID: PMC8116376 DOI: 10.1007/s13311-021-01014-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/18/2021] [Indexed: 12/11/2022] Open
Abstract
Epidemiological sleep research strives to identify the interactions and causal mechanisms by which sleep affects human health, and to design intervention strategies for improving sleep throughout the lifespan. These goals can be advanced by further focusing on the environmental and genetic etiology of sleep disorders, and by development of risk stratification algorithms, to identify people who are at risk or are affected by, sleep disorders. These studies rely on comprehensive sleep-related data which often contains complex multi-dimensional physiological and molecular measurements across multiple timepoints. Thus, sleep research is well-suited for the application of computational approaches that can handle high-dimensional data. Here, we survey recent advances in machine and deep learning together with the availability of large human cohort studies with sleep data that can jointly drive the next breakthroughs in the sleep-research field. We describe sleep-related data types and datasets, and present some of the tasks in the field that can be targets for algorithmic approaches, as well as the challenges and opportunities in pursuing them.
Collapse
Affiliation(s)
- Michael Elgart
- Division of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Boston, MA USA
- Department of Medicine, Harvard Medical School, Boston, MA USA
| | - Susan Redline
- Division of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Boston, MA USA
- Department of Medicine, Harvard Medical School, Boston, MA USA
| | - Tamar Sofer
- Division of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Boston, MA USA
- Department of Medicine, Harvard Medical School, Boston, MA USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA USA
| |
Collapse
|
32
|
Tran A, Walsh CJ, Batt J, Dos Santos CC, Hu P. A machine learning-based clinical tool for diagnosing myopathy using multi-cohort microarray expression profiles. J Transl Med 2020; 18:454. [PMID: 33256785 PMCID: PMC7708151 DOI: 10.1186/s12967-020-02630-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2020] [Accepted: 11/23/2020] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Myopathies are a heterogenous collection of disorders characterized by dysfunction of skeletal muscle. In practice, myopathies are frequently encountered by physicians and precise diagnosis remains a challenge in primary care. Molecular expression profiles show promise for disease diagnosis in various pathologies. We propose a novel machine learning-based clinical tool for predicting muscle disease subtypes using multi-cohort microarray expression data. MATERIALS AND METHODS Muscle tissue samples originating from 1260 patients with muscle weakness. Data was curated from 42 independent cohorts with expression profiles in public microarray gene expression repositories, which represent a broad range of patient ages and peripheral muscles. Cohorts were categorized into five muscle disease subtypes: immobility, inflammatory myopathies, intensive care unit acquired weakness (ICUAW), congenital, and chronic systemic disease. The data contains expression data on 34,099 genes. Data augmentation techniques were used to address class imbalances in the muscle disease subtypes. Support vector machine (SVM) models were trained on two-thirds of the 1260 samples based on the top selected gene signature using analysis of variance (ANOVA). The model was validated in the remaining samples using area under the receiver operator curve (AUC). Gene enrichment analysis was used to identify enriched biological functions in the gene signature. RESULTS The AUC ranges from 0.611 to 0.649 in the observed imbalanced data. Overall, using the augmented data, chronic systemic disease was the best predicted class with AUC 0.872 (95% confidence interval (CI): 0.824-0.920). The least discriminated classes were ICUAW with AUC 0.777 (95% CI: 0.668-0.887) and immobility with AUC 0.789 (95% CI: 0.716-0.861). Disease-specific gene set enrichment results showed that the gene signature was enriched in biological processes including neural precursor cell proliferation for ICUAW and aerobic respiration for congenital (false discovery rate q-value < 0.001). CONCLUSION Our results present a well-performing molecular classification tool with the selected gene markers for muscle disease classification. In practice, this tool addresses an important gap in the literature on myopathies and presents a potentially useful clinical tool for muscle disease subtype diagnosis.
Collapse
Affiliation(s)
- Andrew Tran
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Chris J Walsh
- Keenan Research Center for Biomedical Science, St. Michael's Hospital, Toronto, ON, Canada
- Institute of Medical Sciences and Department of Medicine, University of Toronto, Toronto, ON, Canada
| | - Jane Batt
- Keenan Research Center for Biomedical Science, St. Michael's Hospital, Toronto, ON, Canada
- Interdepartmental Division of Critical Care, St. Michael's Hospital, University of Toronto, 30 Bond Street, Room 4-008, Toronto, ON, M5B 1WB, Canada
| | - Claudia C Dos Santos
- Keenan Research Center for Biomedical Science, St. Michael's Hospital, Toronto, ON, Canada.
- Interdepartmental Division of Critical Care, St. Michael's Hospital, University of Toronto, 30 Bond Street, Room 4-008, Toronto, ON, M5B 1WB, Canada.
| | - Pingzhao Hu
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
- Department of Biochemistry and Medical Genetics, University of Manitoba, 745 Bannatyne Avenue, Winnipeg, MB, R3E 0J9, Canada.
- Research Institute in Oncology and Hematology, Winnipeg, MB, Canada.
| |
Collapse
|
33
|
Casiraghi E, Malchiodi D, Trucco G, Frasca M, Cappelletti L, Fontana T, Esposito AA, Avola E, Jachetti A, Reese J, Rizzi A, Robinson PN, Valentini G. Explainable Machine Learning for Early Assessment of COVID-19 Risk Prediction in Emergency Departments. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 8:196299-196325. [PMID: 34812365 PMCID: PMC8545262 DOI: 10.1109/access.2020.3034032] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Accepted: 10/19/2020] [Indexed: 05/06/2023]
Abstract
Between January and October of 2020, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus has infected more than 34 million persons in a worldwide pandemic leading to over one million deaths worldwide (data from the Johns Hopkins University). Since the virus begun to spread, emergency departments were busy with COVID-19 patients for whom a quick decision regarding in- or outpatient care was required. The virus can cause characteristic abnormalities in chest radiographs (CXR), but, due to the low sensitivity of CXR, additional variables and criteria are needed to accurately predict risk. Here, we describe a computerized system primarily aimed at extracting the most relevant radiological, clinical, and laboratory variables for improving patient risk prediction, and secondarily at presenting an explainable machine learning system, which may provide simple decision criteria to be used by clinicians as a support for assessing patient risk. To achieve robust and reliable variable selection, Boruta and Random Forest (RF) are combined in a 10-fold cross-validation scheme to produce a variable importance estimate not biased by the presence of surrogates. The most important variables are then selected to train a RF classifier, whose rules may be extracted, simplified, and pruned to finally build an associative tree, particularly appealing for its simplicity. Results show that the radiological score automatically computed through a neural network is highly correlated with the score computed by radiologists, and that laboratory variables, together with the number of comorbidities, aid risk prediction. The prediction performance of our approach was compared to that that of generalized linear models and shown to be effective and robust. The proposed machine learning-based computational system can be easily deployed and used in emergency departments for rapid and accurate risk prediction in COVID-19 patients.
Collapse
Affiliation(s)
- Elena Casiraghi
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
- CINI National Laboratory of Artificial Intelligence and Intelligent Systems (AIIS)Università di Roma00185RomaItaly
| | - Dario Malchiodi
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
- CINI National Laboratory of Artificial Intelligence and Intelligent Systems (AIIS)Università di Roma00185RomaItaly
- Data Science Research CenterUniversità degli Studi di Milano20133MilanItaly
| | - Gabriella Trucco
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
| | - Marco Frasca
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
| | - Luca Cappelletti
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
| | - Tommaso Fontana
- Dipartimento di ElettronicaInformazione e BioingegneriaPolitecnico di Milano20133MilanItaly
| | | | - Emanuele Avola
- Postgraduate School in RadiodiagnosticsUniversità degli Studi di Milano20122MilanItaly
| | - Alessandro Jachetti
- Accident and Emergency DepartmentFondazione IRCCS Ca Granda Ospedale Maggiore Policlinico20122MilanItaly
| | - Justin Reese
- Division of Environmental Genomics and Systems BiologyLawrence Berkeley National LaboratoryBerkeleyCA94720USA
| | - Alessandro Rizzi
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
| | | | - Giorgio Valentini
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
- CINI National Laboratory of Artificial Intelligence and Intelligent Systems (AIIS)Università di Roma00185RomaItaly
- Data Science Research CenterUniversità degli Studi di Milano20133MilanItaly
| |
Collapse
|
34
|
Fujiwara K, Huang Y, Hori K, Nishioji K, Kobayashi M, Kamaguchi M, Kano M. Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis. Front Public Health 2020; 8:178. [PMID: 32509717 PMCID: PMC7248318 DOI: 10.3389/fpubh.2020.00178] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2020] [Accepted: 04/22/2020] [Indexed: 11/23/2022] Open
Abstract
A considerable amount of health record (HR) data has been stored due to recent advances in the digitalization of medical systems. However, it is not always easy to analyze HR data, particularly when the number of persons with a target disease is too small in comparison with the population. This situation is called the imbalanced data problem. Over-sampling and under-sampling are two approaches for redressing an imbalance between minority and majority examples, which can be combined into ensemble algorithms. However, these approaches do not function when the absolute number of minority examples is small, which is called the extremely imbalanced and small minority (EISM) data problem. The present work proposes a new algorithm called boosting combined with heuristic under-sampling and distribution-based sampling (HUSDOS-Boost) to solve the EISM data problem. To make an artificially balanced dataset from the original imbalanced datasets, HUSDOS-Boost uses both under-sampling and over-sampling to eliminate redundant majority examples based on prior boosting results and to generate artificial minority examples by following the minority class distribution. The performance and characteristics of HUSDOS-Boost were evaluated through application to eight imbalanced datasets. In addition, the algorithm was applied to original clinical HR data to detect patients with stomach cancer. These results showed that HUSDOS-Boost outperformed current imbalanced data handling methods, particularly when the data are EISM. Thus, the proposed HUSDOS-Boost is a useful methodology of HR data analysis.
Collapse
Affiliation(s)
- Koichi Fujiwara
- Department of Material Process Engineering, Nagoya University, Nagoya, Japan
| | - Yukun Huang
- Department of Systems Science, Kyoto University, Kyoto, Japan
| | - Kentaro Hori
- Department of Systems Science, Kyoto University, Kyoto, Japan
| | - Kenichi Nishioji
- Health Care Division, Japanese Red Cross Kyoto Daini Hospital, Kyoto, Japan
| | - Masao Kobayashi
- Health Care Division, Japanese Red Cross Kyoto Daini Hospital, Kyoto, Japan
| | - Mai Kamaguchi
- Health Care Division, Japanese Red Cross Kyoto Daini Hospital, Kyoto, Japan
| | - Manabu Kano
- Department of Systems Science, Kyoto University, Kyoto, Japan
| |
Collapse
|
35
|
Bocher O, Génin E. Rare variant association testing in the non-coding genome. Hum Genet 2020; 139:1345-1362. [PMID: 32500240 DOI: 10.1007/s00439-020-02190-y] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Accepted: 05/29/2020] [Indexed: 12/25/2022]
Abstract
The development of next-generation sequencing technologies has opened-up some new possibilities to explore the contribution of genetic variants to human diseases and in particular that of rare variants. Statistical methods have been developed to test for association with rare variants that require the definition of testing units and, in these testing units, the selection of qualifying variants to include in the test. In the coding regions of the genome, testing units are usually the different genes and qualifying variants are selected based on their functional effects on the encoded proteins. Extending these tests to the non-coding regions of the genome is challenging. Testing units are difficult to define as the non-coding genome organisation is still rather unknown. Qualifying variants are difficult to select as the functional impact of non-coding variants on gene expression is hard to predict. These difficulties could explain why very few investigators so far have analysed the non-coding parts of their whole genome sequencing data. These non-coding parts yet represent the vast majority of the genome and some studies suggest that they could play a major role in disease susceptibility. In this review, we discuss recent experimental and statistical developments to gain knowledge on the non-coding genome and how this knowledge could be used to include rare non-coding variants in association tests. We describe the few studies that have considered variants from the non-coding genome in association tests and how they managed to define testing units and select qualifying variants.
Collapse
Affiliation(s)
- Ozvan Bocher
- Génétique, Génomique Fonctionnelle Et Biotechnologies, Faculté de Médecine, Univ Brest, Inserm, Inserm UMR1078, Bâtiment E-IBRBS 2ieme étage, 22 avenue Camille Desmoulins, 29238, Brest Cedex 3, France.
| | - Emmanuelle Génin
- Génétique, Génomique Fonctionnelle Et Biotechnologies, Faculté de Médecine, Univ Brest, Inserm, Inserm UMR1078, Bâtiment E-IBRBS 2ieme étage, 22 avenue Camille Desmoulins, 29238, Brest Cedex 3, France.
- CHU Brest, Brest, France.
| |
Collapse
|
36
|
Petrini A, Mesiti M, Schubach M, Frasca M, Danis D, Re M, Grossi G, Cappelletti L, Castrignanò T, Robinson PN, Valentini G. parSMURF, a high-performance computing tool for the genome-wide detection of pathogenic variants. Gigascience 2020; 9:giaa052. [PMID: 32444882 PMCID: PMC7244787 DOI: 10.1093/gigascience/giaa052] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2019] [Revised: 10/31/2019] [Accepted: 04/28/2020] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Several prediction problems in computational biology and genomic medicine are characterized by both big data as well as a high imbalance between examples to be learned, whereby positive examples can represent a tiny minority with respect to negative examples. For instance, deleterious or pathogenic variants are overwhelmed by the sea of neutral variants in the non-coding regions of the genome: thus, the prediction of deleterious variants is a challenging, highly imbalanced classification problem, and classical prediction tools fail to detect the rare pathogenic examples among the huge amount of neutral variants or undergo severe restrictions in managing big genomic data. RESULTS To overcome these limitations we propose parSMURF, a method that adopts a hyper-ensemble approach and oversampling and undersampling techniques to deal with imbalanced data, and parallel computational techniques to both manage big genomic data and substantially speed up the computation. The synergy between Bayesian optimization techniques and the parallel nature of parSMURF enables efficient and user-friendly automatic tuning of the hyper-parameters of the algorithm, and allows specific learning problems in genomic medicine to be easily fit. Moreover, by using MPI parallel and machine learning ensemble techniques, parSMURF can manage big data by partitioning them across the nodes of a high-performance computing cluster. Results with synthetic data and with single-nucleotide variants associated with Mendelian diseases and with genome-wide association study hits in the non-coding regions of the human genome, involhing millions of examples, show that parSMURF achieves state-of-the-art results and an 80-fold speed-up with respect to the sequential version. CONCLUSIONS parSMURF is a parallel machine learning tool that can be trained to learn different genomic problems, and its multiple levels of parallelization and high scalability allow us to efficiently fit problems characterized by big and imbalanced genomic data. The C++ OpenMP multi-core version tailored to a single workstation and the C++ MPI/OpenMP hybrid multi-core and multi-node parSMURF version tailored to a High Performance Computing cluster are both available at https://github.com/AnacletoLAB/parSMURF.
Collapse
Affiliation(s)
- Alessandro Petrini
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
| | - Marco Mesiti
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
| | - Max Schubach
- Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Straße 2, 10178 Berlin, Germany
- Charité – Universitätsmedizin Berlin, Chariteplatz 1, 10117 Berlin, Germany
| | - Marco Frasca
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
| | - Daniel Danis
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington (CT) - 06032, United States of America
| | - Matteo Re
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
| | - Giuliano Grossi
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
| | - Luca Cappelletti
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
| | - Tiziana Castrignanò
- CINECA, SCAI SuperComputing Applications and Innovation Department, Via dei Tizii 6, 00185 Roma, Italy
- University of Tuscia, Department of Ecological and Biological Sciences (DEB), Largo dell'Università snc, 01100 Viterbo, Italy
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington (CT) - 06032, United States of America
| | - Giorgio Valentini
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
- CINI National Laboratory in Artificial Intelligence and Intelligent Systems - AIIS, Università di Roma, Via Ariosto 25, 00185 Roma, Italy
| |
Collapse
|
37
|
Nicholls HL, John CR, Watson DS, Munroe PB, Barnes MR, Cabrera CP. Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci. Front Genet 2020; 11:350. [PMID: 32351543 PMCID: PMC7174742 DOI: 10.3389/fgene.2020.00350] [Citation(s) in RCA: 82] [Impact Index Per Article: 16.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Accepted: 03/23/2020] [Indexed: 12/21/2022] Open
Abstract
Genome-wide association studies (GWAS) have revealed thousands of genetic loci that underpin the complex biology of many human traits. However, the strength of GWAS - the ability to detect genetic association by linkage disequilibrium (LD) - is also its limitation. Whilst the ever-increasing study size and improved design have augmented the power of GWAS to detect effects, differentiation of causal variants or genes from other highly correlated genes associated by LD remains the real challenge. This has severely hindered the biological insights and clinical translation of GWAS findings. Although thousands of disease susceptibility loci have been reported, causal genes at these loci remain elusive. Machine learning (ML) techniques offer an opportunity to dissect the heterogeneity of variant and gene signals in the post-GWAS analysis phase. ML models for GWAS prioritization vary greatly in their complexity, ranging from relatively simple logistic regression approaches to more complex ensemble models such as random forests and gradient boosting, as well as deep learning models, i.e., neural networks. Paired with functional validation, these methods show important promise for clinical translation, providing a strong evidence-based approach to direct post-GWAS research. However, as ML approaches continue to evolve to meet the challenge of causal gene identification, a critical assessment of the underlying methodologies and their applicability to the GWAS prioritization problem is needed. This review investigates the landscape of ML applications in three parts: selected models, input features, and output model performance, with a focus on prioritizations of complex disease associated loci. Overall, we explore the contributions ML has made towards reaching the GWAS end-game with consequent wide-ranging translational impact.
Collapse
Affiliation(s)
- Hannah L. Nicholls
- Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | - Christopher R. John
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Centre for Experimental Medicine and Rheumatology, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | - David S. Watson
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Oxford Internet Institute, University of Oxford, Oxford, United Kingdom
| | - Patricia B. Munroe
- Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- NIHR Barts Biomedical Research Centre, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | - Michael R. Barnes
- Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- NIHR Barts Biomedical Research Centre, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- The Alan Turing Institute, British Library, London, United Kingdom
| | - Claudia P. Cabrera
- Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- NIHR Barts Biomedical Research Centre, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| |
Collapse
|
38
|
Synthetic Minority Oversampling Technique for Optimizing Classification Tasks in Botnet and Intrusion-Detection-System Datasets. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10030794] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Presently, security is a hot research topic due to the impact in daily information infrastructure. Machine-learning solutions have been improving classical detection practices, but detection tasks employ irregular amounts of data since the number of instances that represent one or several malicious samples can significantly vary. In highly unbalanced data, classification models regularly have high precision with respect to the majority class, while minority classes are considered noise due to the lack of information that they provide. Well-known datasets used for malware-based analyses like botnet attacks and Intrusion Detection Systems (IDS) mainly comprise logs, records, or network-traffic captures that do not provide an ideal source of evidence as a result of obtaining raw data. As an example, the numbers of abnormal and constant connections generated by either botnets or intruders within a network are considerably smaller than those from benign applications. In most cases, inadequate dataset design may lead to the downgrade of a learning algorithm, resulting in overfitting and poor classification rates. To address these problems, we propose a resampling method, the Synthetic Minority Oversampling Technique (SMOTE) with a grid-search algorithm optimization procedure. This work demonstrates classification-result improvements for botnet and IDS datasets by merging synthetically generated balanced data and tuning different supervised-learning algorithms.
Collapse
|
39
|
Vervier K, Michaelson JJ. TiSAn: estimating tissue-specific effects of coding and non-coding variants. Bioinformatics 2019; 34:3061-3068. [PMID: 29912365 PMCID: PMC6137979 DOI: 10.1093/bioinformatics/bty301] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2017] [Accepted: 04/16/2018] [Indexed: 02/06/2023] Open
Abstract
Motivation Model-based estimates of general deleteriousness, like CADD, DANN or PolyPhen, have become indispensable tools in the interpretation of genetic variants. However, these approaches say little about the tissues in which the effects of deleterious variants will be most meaningful. Tissue-specific annotations have been recently inferred for dozens of tissues/cell types from large collections of cross-tissue epigenomic data, and have demonstrated sensitivity in predicting affected tissues in complex traits. It remains unclear, however, whether including additional genome-scale data specific to the tissue of interest would appreciably improve functional annotations. Results Herein, we introduce TiSAn, a tool that integrates multiple genome-scale data sources, defined by expert knowledge. TiSAn uses machine learning to discriminate variants relevant to a tissue from those with no bearing on the function of that tissue. Predictions are made genome-wide, and can be used to contextualize and filter variants of interest in whole genome sequencing or genome-wide association studies. We demonstrate the accuracy and flexibility of TiSAn by producing predictive models for human heart and brain, and detecting tissue-relevant variations in large cohorts for autism spectrum disorder (TiSAn-brain) and coronary artery disease (TiSAn-heart). We find the multiomics TiSAn model is better able to prioritize genetic variants according to their tissue-specific action than the current state-of-the-art method, GenoSkyLine. Availability and implementation Software and vignettes are available at http://github.com/kevinVervier/TiSAn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kévin Vervier
- Department of Psychiatry, Carver College of Medicine, University of Iowa, Iowa City, IA, USA
| | - Jacob J Michaelson
- Department of Psychiatry, Carver College of Medicine, University of Iowa, Iowa City, IA, USA
| |
Collapse
|
40
|
Mossotto E, Ashton JJ, O'Gorman L, Pengelly RJ, Beattie RM, MacArthur BD, Ennis S. GenePy - a score for estimating gene pathogenicity in individuals using next-generation sequencing data. BMC Bioinformatics 2019; 20:254. [PMID: 31096927 PMCID: PMC6524327 DOI: 10.1186/s12859-019-2877-3] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Accepted: 05/06/2019] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Next-generation sequencing is revolutionising diagnosis and treatment of rare diseases, however its application to understanding common disease aetiology is limited. Rare disease applications binarily attribute genetic change(s) at a single locus to a specific phenotype. In common diseases, where multiple genetic variants within and across genes contribute to disease, binary modelling cannot capture the burden of pathogenicity harboured by an individual across a given gene/pathway. We present GenePy, a novel gene-level scoring system for integration and analysis of next-generation sequencing data on a per-individual basis that transforms NGS data interpretation from variant-level to gene-level. This simple and flexible scoring system is intuitive and amenable to integration for machine learning, network and topological approaches, facilitating the investigation of complex phenotypes. RESULTS Whole-exome sequencing data from 508 individuals were used to generate GenePy scores. For each variant a score is calculated incorporating: i) population allele frequency estimates; ii) individual zygosity, determined through standard variant calling pipelines and; iii) any user defined deleteriousness metric to inform on functional impact. GenePy then combines scores generated for all variants observed into a single gene score for each individual. We generated a matrix of ~ 14,000 GenePy scores for all individuals for each of sixteen popular deleteriousness metrics. All per-gene scores are corrected for gene length. The majority of genes generate GenePy scores < 0.01 although individuals harbouring multiple rare highly deleterious mutations can accumulate extremely high GenePy scores. In the absence of a comparator metric, we examine GenePy performance in discriminating genes known to be associated with three common, complex diseases. A Mann-Whitney U test conducted on GenePy scores for this positive control gene in cases versus controls demonstrates markedly more significant results (p = 1.37 × 10- 4) compared to the most commonly applied association tool that combines common and rare variation (p = 0.003). CONCLUSIONS Per-gene per-individual GenePy scores are intuitive when assessing genetic variation in individual patients or comparing scores between groups. GenePy outperforms the currently accepted best practice tools for combining common and rare variation. GenePy scores are suitable for downstream data integration with transcriptomic and proteomic data that also report at the gene level.
Collapse
Affiliation(s)
- E Mossotto
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK.
- Institute for Life Sciences, University of Southampton, Southampton, UK.
| | - J J Ashton
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
- Department of Paediatric Gastroenterology, Southampton Children's Hospital, Southampton, UK
| | - L O'Gorman
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
| | - R J Pengelly
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
- Institute for Life Sciences, University of Southampton, Southampton, UK
| | - R M Beattie
- Department of Paediatric Gastroenterology, Southampton Children's Hospital, Southampton, UK
| | - B D MacArthur
- Institute for Life Sciences, University of Southampton, Southampton, UK
| | - S Ennis
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
| |
Collapse
|
41
|
A Brief Review of Random Forests for Water Scientists and Practitioners and Their Recent History in Water Resources. WATER 2019. [DOI: 10.3390/w11050910] [Citation(s) in RCA: 102] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Random forests (RF) is a supervised machine learning algorithm, which has recently started to gain prominence in water resources applications. However, existing applications are generally restricted to the implementation of Breiman’s original algorithm for regression and classification problems, while numerous developments could be also useful in solving diverse practical problems in the water sector. Here we popularize RF and their variants for the practicing water scientist, and discuss related concepts and techniques, which have received less attention from the water science and hydrologic communities. In doing so, we review RF applications in water resources, highlight the potential of the original algorithm and its variants, and assess the degree of RF exploitation in a diverse range of applications. Relevant implementations of random forests, as well as related concepts and techniques in the R programming language, are also covered.
Collapse
|
42
|
Weissenkampen JD, Jiang Y, Eckert S, Jiang B, Li B, Liu DJ. Methods for the Analysis and Interpretation for Rare Variants Associated with Complex Traits. CURRENT PROTOCOLS IN HUMAN GENETICS 2019; 101:e83. [PMID: 30849219 PMCID: PMC6455968 DOI: 10.1002/cphg.83] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
With the advent of Next Generation Sequencing (NGS) technologies, whole genome and whole exome DNA sequencing has become affordable for routine genetic studies. Coupled with improved genotyping arrays and genotype imputation methodologies, it is increasingly feasible to obtain rare genetic variant information in large datasets. Such datasets allow researchers to gain a more complete understanding of the genetic architecture of complex traits caused by rare variants. State-of-the-art statistical methods for the statistical genetics analysis of sequence-based association, including efficient algorithms for association analysis in biobank-scale datasets, gene-association tests, meta-analysis, fine mapping methods that integrate functional genomic dataset, and phenome-wide association studies (PheWAS), are reviewed here. These methods are expected to be highly useful for next generation statistical genetics analysis in the era of precision medicine. © 2019 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
| | - Yu Jiang
- Department of Public Health Sciences, Penn State College of Medicine, Hershey PA
| | - Scott Eckert
- Department of Public Health Sciences, Penn State College of Medicine, Hershey PA
| | - Bibo Jiang
- Department of Public Health Sciences, Penn State College of Medicine, Hershey PA
| | - Bingshan Li
- Department of Molecular Physiology and Biophysics, Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN
| | - Dajiang J. Liu
- Department of Public Health Sciences, Penn State College of Medicine, Hershey PA
| |
Collapse
|
43
|
PINES: phenotype-informed tissue weighting improves prediction of pathogenic noncoding variants. Genome Biol 2018; 19:173. [PMID: 30359302 PMCID: PMC6203199 DOI: 10.1186/s13059-018-1546-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Accepted: 09/19/2018] [Indexed: 12/17/2022] Open
Abstract
Functional characterization of the noncoding genome is essential for biological understanding of gene regulation and disease. Here, we introduce the computational framework PINES (Phenotype-Informed Noncoding Element Scoring), which predicts the functional impact of noncoding variants by integrating epigenetic annotations in a phenotype-dependent manner. PINES enables analyses to be customized towards genomic annotations from cell types of the highest relevance given the phenotype of interest. We illustrate that PINES identifies functional noncoding variation more accurately than methods that do not use phenotype-weighted knowledge, while at the same time being flexible and easy to use via a dedicated web portal.
Collapse
|
44
|
Letter to the editor: Predicting central line-associated bloodstream infections and mortality using supervised machine learning. J Crit Care 2018; 46:162. [DOI: 10.1016/j.jcrc.2018.05.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2018] [Accepted: 05/07/2018] [Indexed: 11/19/2022]
|
45
|
Brown N, Cambruzzi J, Cox PJ, Davies M, Dunbar J, Plumbley D, Sellwood MA, Sim A, Williams-Jones BI, Zwierzyna M, Sheppard DW. Big Data in Drug Discovery. PROGRESS IN MEDICINAL CHEMISTRY 2018; 57:277-356. [PMID: 29680150 DOI: 10.1016/bs.pmch.2017.12.003] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
Interpretation of Big Data in the drug discovery community should enhance project timelines and reduce clinical attrition through improved early decision making. The issues we encounter start with the sheer volume of data and how we first ingest it before building an infrastructure to house it to make use of the data in an efficient and productive way. There are many problems associated with the data itself including general reproducibility, but often, it is the context surrounding an experiment that is critical to success. Help, in the form of artificial intelligence (AI), is required to understand and translate the context. On the back of natural language processing pipelines, AI is also used to prospectively generate new hypotheses by linking data together. We explain Big Data from the context of biology, chemistry and clinical trials, showcasing some of the impressive public domain sources and initiatives now available for interrogation.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Aaron Sim
- BenevolentAI, London, United Kingdom
| | | | - Magdalena Zwierzyna
- BenevolentAI, London, United Kingdom; Institute of Cardiovascular Science, University College London, London, United Kingdom
| | | |
Collapse
|