1
|
Trivedi SK, Roy AD, Kumar P, Jena D, Sinha A. Prediction of consumers refill frequency of LPG: A study using explainable machine learning. Heliyon 2024; 10:e23466. [PMID: 38205330 PMCID: PMC10776940 DOI: 10.1016/j.heliyon.2023.e23466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Revised: 11/21/2023] [Accepted: 12/05/2023] [Indexed: 01/12/2024] Open
Abstract
Launched in 2016, the PMUY Programme of the Government of India aimed to provide 8 crore LPG connections to women in rural households over four years. After acquiring a new connection, some households appeared uninterested in ordering subsequent subsidized LPG refills, impacting programme's sustainability, and targeting strategy. We propose a prediction model using "Explainable Machine Learning" to anticipate the beneficiaries' refill frequency with a view to improving LPG-refills and social targeting. In this paper, we suggest an enhanced stacked SVM (ISS) model for classification, which is contrasted with state-of-art ML models: Random Forest (RF), SVM-RBF, Naive Bayes (NB), and Decision Tree (C5.0). Some of the performance matrices that are used to evaluate the models include accuracy, sensitivity, specificity, Cohen's Kappa statistics, Receiver Operating Characteristic curve (ROC), and area under the curve (AUC). The proposed approach, which was validated with 10-fold cross validation, produced the best overall accuracies for data splits of 50-50, 66-34, and 80-20. The "Explainable AI (XAI)" model has also been used to describe how models and features interact, and to discuss the importance of features and their contributions to prediction. The recommended XAI will aid in efficient "beneficiary targeting" and "policy interventions".
Collapse
Affiliation(s)
- Shrawan Kumar Trivedi
- Business Analytics and Information Systems Area, Rajiv Gandhi Institute of Petroleum Technology, Amethi, India
| | - Abhijit Deb Roy
- Operations Management, Rajiv Gandhi Institute of Petroleum Technology, Amethi, India
| | - Praveen Kumar
- Management Area, Rajiv Gandhi Institute of Petroleum Technology, Amethi, India
| | - Debashish Jena
- Operations Management Area, Rajiv Gandhi Institute of Petroleum Technology, Amethi, India
| | - Avik Sinha
- Management Development Institute Gurgaon, India
| |
Collapse
|
2
|
Gao M, Günther S. HyperCys: A Structure- and Sequence-Based Predictor of Hyper-Reactive Druggable Cysteines. Int J Mol Sci 2023; 24:ijms24065960. [PMID: 36983037 PMCID: PMC10054327 DOI: 10.3390/ijms24065960] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 03/15/2023] [Accepted: 03/17/2023] [Indexed: 03/30/2023] Open
Abstract
The cysteine side chain has a free thiol group, making it the amino acid residue most often covalently modified by small molecules possessing weakly electrophilic warheads, thereby prolonging on-target residence time and reducing the risk of idiosyncratic drug toxicity. However, not all cysteines are equally reactive or accessible. Hence, to identify targetable cysteines, we propose a novel ensemble stacked machine learning (ML) model to predict hyper-reactive druggable cysteines, named HyperCys. First, the pocket, conservation, structural and energy profiles, and physicochemical properties of (non)covalently bound cysteines were collected from both protein sequences and 3D structures of protein-ligand complexes. Then, we established the HyperCys ensemble stacked model by integrating six different ML models, including K-nearest neighbors, support vector machine, light gradient boost machine, multi-layer perceptron classifier, random forest, and the meta-classifier model logistic regression. Finally, based on the hyper-reactive cysteines' classification accuracy and other metrics, the results for different feature group combinations were compared. The results show that the accuracy, F1 score, recall score, and ROC AUC values of HyperCys are 0.784, 0.754, 0.742, and 0.824, respectively, after performing 10-fold CV with the best window size. Compared to traditional ML models with only sequenced-based features or only 3D structural features, HyperCys is more accurate at predicting hyper-reactive druggable cysteines. It is anticipated that HyperCys will be an effective tool for discovering new potential reactive cysteines in a wide range of nucleophilic proteins and will provide an important contribution to the design of targeted covalent inhibitors with high potency and selectivity.
Collapse
Affiliation(s)
- Mingjie Gao
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Hermann-Herder-Straße 9, 79104 Freiburg, Germany
| | - Stefan Günther
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Hermann-Herder-Straße 9, 79104 Freiburg, Germany
| |
Collapse
|
3
|
Maroli N, Bhasuran B, Natarajan J, Kolandaivel P. The potential role of procyanidin as a therapeutic agent against SARS-CoV-2: a text mining, molecular docking and molecular dynamics simulation approach. J Biomol Struct Dyn 2022; 40:1230-1245. [PMID: 32960159 PMCID: PMC7544928 DOI: 10.1080/07391102.2020.1823887] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2020] [Accepted: 09/10/2020] [Indexed: 02/07/2023]
Abstract
A novel coronavirus (SARS-CoV-2) has caused a major outbreak in human all over the world. There are several proteins interplay during the entry and replication of this virus in human. Here, we have used text mining and named entity recognition method to identify co-occurrence of the important COVID 19 genes/proteins in the interaction network based on the frequency of the interaction. Network analysis revealed a set of genes/proteins, highly dense genes/protein clusters and sub-networks of Angiotensin-converting enzyme 2 (ACE2), Helicase, spike (S) protein (trimeric), membrane (M) protein, envelop (E) protein, and the nucleocapsid (N) protein. The isolated proteins are screened against procyanidin-a flavonoid from plants using molecular docking. Further, molecular dynamics simulation of critical proteins such as ACE2, Mpro and spike proteins are performed to elucidate the inhibition mechanism. The strong network of hydrogen bonds and hydrophobic interactions along with van der Waals interactions inhibit receptors, which are essential to the entry and replication of the SARS-CoV-2. The binding energy which largely arises from van der Waals interactions is calculated (ACE2=-50.21 ± 6.3, Mpro=-89.50 ± 6.32 and spike=-23.06 ± 4.39) through molecular mechanics Poisson-Boltzmann surface area also confirm the affinity of procyanidin towards the critical receptors. Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Nikhil Maroli
- Computational Biology Division, DRDO Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamil Nadu, India
- Center for Condensed Matter Theory, Department of Physics, Indian Institute of Science, Bangalore, Karnataka, India
| | - Balu Bhasuran
- Computational Biology Division, DRDO Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamil Nadu, India
| | - Jeyakumar Natarajan
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamil Nadu, India
| | | |
Collapse
|
4
|
Bhasuran B. Combining Literature Mining and Machine Learning for Predicting Biomedical Discoveries. Methods Mol Biol 2022; 2496:123-140. [PMID: 35713862 DOI: 10.1007/978-1-0716-2305-3_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The major outcomes and insights of scientific research and clinical study end up in the form of publication or clinical record in an unstructured text format. Due to advancements in biomedical research, the growth of published literature is getting tremendous large in recent years. The scientists and clinical researchers are facing a big challenge to stay current with the knowledge and to extract hidden information from this sheer quantity of millions of published biomedical literature. The potential one-stop automated solution to this problem is biomedical literature mining. One of the long-standing goals in biology is to discover the disease-causing genes and their specific roles in personalized precision medicine and drug repurposing. However, the empirical approaches and clinical affirmation are expensive and time-consuming. In silico approach using text mining to identify the disease causing genes can contribute towards biomarker discovery. This chapter presents a protocol on combining literature mining and machine learning for predicting biomedical discoveries with a special emphasis on gene-disease relation based discovery. The protocol is presented as a literature based discovery (LBD) pipeline for gene-disease based discovery. The protocol includes our web based tools: (1) DNER (Disease Named Entity Recognizer) for disease entity recognition, (2) BCCNER (Bidirectional, Contextual clues Named Entity Tagger) for gene/protein entity recognition, (3) DisGeReExT (Disease-Gene Relation Extractor) for statistically validated results and visualization, and (4) a newly introduced deep learning based method for association discovery. Our proposed deep learning based method can be generalized and applied to other important biomedical discoveries focusing on entities such as drug/chemical, or miRNA.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India.
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA.
| |
Collapse
|
5
|
Abdulkadhar S, Natarajan J. A Text Mining Protocol for Mining Biological Pathways and Regulatory Networks from Biomedical Literature. Methods Mol Biol 2022; 2496:141-157. [PMID: 35713863 DOI: 10.1007/978-1-0716-2305-3_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
A biological pathway or regulatory network is a collection of molecular regulators which can activate the changes in cellular processes leading to an assembly of new molecules by series of actions among the molecules. There are three important pathways in system biology studies namely signaling pathways, metabolic pathways, and genetic pathways (or) gene regulatory networks. Recently, biological pathway construction from scientific literature is given much attention as the scientific literature contains a rich set of linguistic features to extract biological associations between genes and proteins. These associations can be united to construct biological networks. Here, we present a brief overview about various biological pathways, biomedical text resources/corpora for network construction and state-of-the-art existing methods for network construction followed by our hybrid text mining protocol for extracting pathways and regulatory networks from biomedical literature.
Collapse
Affiliation(s)
- Sabenabanu Abdulkadhar
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India
| | - Jeyakumar Natarajan
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India.
| |
Collapse
|
6
|
Shahhosseini M, Hu G, Pham H. Optimizing ensemble weights and hyperparameters of machine learning models for regression problems. MACHINE LEARNING WITH APPLICATIONS 2022. [DOI: 10.1016/j.mlwa.2022.100251] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022] Open
|
7
|
Bhasuran B. BioBERT and Similar Approaches for Relation Extraction. Methods Mol Biol 2022; 2496:221-235. [PMID: 35713867 DOI: 10.1007/978-1-0716-2305-3_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In biomedicine, facts about relations between entities (disease, gene, drug, etc.) are hidden in the large trove of 30 million scientific publications. The curated information is proven to play an important role in various applications such as drug repurposing and precision medicine. Recently, due to the advancement in deep learning a transformer architecture named BERT (Bidirectional Encoder Representations from Transformers) has been proposed. This pretrained language model trained using the Books Corpus with 800M words and English Wikipedia with 2500M words reported state of the art results in various NLP (Natural Language Processing) tasks including relation extraction. It is a widely accepted notion that due to the word distribution shift, general domain models exhibit poor performance in information extraction tasks of the biomedical domain. Due to this, an architecture is later adapted to the biomedical domain by training the language models using 28 million scientific literatures from PubMed and PubMed central. This chapter presents a protocol for relation extraction using BERT by discussing state-of-the-art for BERT versions in the biomedical domain such as BioBERT. The protocol emphasis on general BERT architecture, pretraining and fine tuning, leveraging biomedical information, and finally a knowledge graph infusion to the BERT model layer.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India.
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA.
| |
Collapse
|
8
|
Active learning approach using a modified least confidence sampling strategy for named entity recognition. PROGRESS IN ARTIFICIAL INTELLIGENCE 2021. [DOI: 10.1007/s13748-021-00230-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
9
|
Abdulkadhar S, Bhasuran B, Natarajan J. Multiscale Laplacian graph kernel combined with lexico-syntactic patterns for biomedical event extraction from literature. Knowl Inf Syst 2020. [DOI: 10.1007/s10115-020-01514-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
10
|
Ribeiro MHDM, Mariani VC, Coelho LDS. Multi-step ahead meningitis case forecasting based on decomposition and multi-objective optimization methods. J Biomed Inform 2020; 111:103575. [PMID: 32976990 PMCID: PMC7507988 DOI: 10.1016/j.jbi.2020.103575] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Revised: 09/10/2020] [Accepted: 09/16/2020] [Indexed: 12/23/2022]
Abstract
Epidemiological time series forecasting plays an important role in health public systems, due to its ability to allow managers to develop strategic planning to avoid possible epidemics. In this paper, a hybrid learning framework is developed to forecast multi-step-ahead (one, two, and three-month-ahead) meningitis cases in four states of Brazil. First, the proposed approach applies an ensemble empirical mode decomposition (EEMD) to decompose the data into intrinsic mode functions and residual components. Then, each component is used as the input of five different forecasting models, and, from there, forecasted results are obtained. Finally, all combinations of models and components are developed, and for each case, the forecasted results are weighted integrated (WI) to formulate a heterogeneous ensemble forecaster for the monthly meningitis cases. In the final stage, a multi-objective optimization (MOO) using the Non-Dominated Sorting Genetic Algorithm – version II is employed to find a set of candidates’ weights, and then the Technique for Order of Preference by similarity to Ideal Solution (TOPSIS) is applied to choose the adequate set of weights. Next, the most adequate model is the one with the best generalization capacity out-of-sample in terms of performance criteria including mean absolute error (MAE), relative root mean squared error (RRMSE), and symmetric mean absolute percentage error (sMAPE). By using MOO, the intention is to enhance the performance of the forecasting models by improving simultaneously their accuracy and stability measures. To access the model’s performance, comparisons based on metrics are conducted with: (i) EEMD, heterogeneous ensemble integrated by direct strategy, or simple sum; (ii) EEMD, homogeneous ensemble of components WI; (iii) models without signal decomposition. At this stage, MAE, RRMSE, and sMAPE criteria as well as Diebold–Mariano statistical test are adopted. In all twelve scenarios, the proposed framework was able to perform more accurate and stable forecasts, which showed, on 89.17% of the cases, that the errors of the proposed approach are statistically lower than other approaches. These results showed that combining EEMD, heterogeneous ensemble and WI with weights obtained by optimization can develop precise and stable forecasts. The modeling developed in this paper is promising and can be used by managers to support decision making. Ensemble empirical mode decomposition is applied into the raw time series. Heterogeneous ensemble learning models are used to forecasting meningitis cases. The NSGA-II algorithm and TOPSIS criterion are employed in the multi-objective procedure. Proposed model has errors statistically lower than 89.17% of the compared models. Promising results are achieved by the weighted integrated ensemble learning model.
Collapse
Affiliation(s)
- Matheus Henrique Dal Molin Ribeiro
- Graduate Program in Industrial & Systems Engineering (PPGEPS), Pontifical Catholic University of Parana (PUCPR), 1155, Rua Imaculada Conceicao, Curitiba, Parana, 80215-901, Brazil; Department of Mathematics, Federal Technological University of Parana (UTFPR), Via do Conhecimento, KM 01 - Fraron, Pato Branco, Parana, 85503-390, Brazil.
| | - Viviana Cocco Mariani
- Department of Electrical Engineering, Federal University of Parana (UFPR), 100, Avenida Cel. Francisco dos Santos, Curitiba, Parana, 81530-000, Brazil; Mechanical Engineering Graduate Program (PPGEM), Pontifical Catholic University of Parana (PUCPR), 1155, Rua Imaculada Conceicao, Curitiba, Parana, 80215-901, Brazil
| | - Leandro Dos Santos Coelho
- Graduate Program in Industrial & Systems Engineering (PPGEPS), Pontifical Catholic University of Parana (PUCPR), 1155, Rua Imaculada Conceicao, Curitiba, Parana, 80215-901, Brazil; Department of Electrical Engineering, Federal University of Parana (UFPR), 100, Avenida Cel. Francisco dos Santos, Curitiba, Parana, 81530-000, Brazil
| |
Collapse
|
11
|
Perera N, Dehmer M, Emmert-Streib F. Named Entity Recognition and Relation Detection for Biomedical Information Extraction. Front Cell Dev Biol 2020; 8:673. [PMID: 32984300 PMCID: PMC7485218 DOI: 10.3389/fcell.2020.00673] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 07/02/2020] [Indexed: 12/29/2022] Open
Abstract
The number of scientific publications in the literature is steadily growing, containing our knowledge in the biomedical, health, and clinical sciences. Since there is currently no automatic archiving of the obtained results, much of this information remains buried in textual details not readily available for further usage or analysis. For this reason, natural language processing (NLP) and text mining methods are used for information extraction from such publications. In this paper, we review practices for Named Entity Recognition (NER) and Relation Detection (RD), allowing, e.g., to identify interactions between proteins and drugs or genes and diseases. This information can be integrated into networks to summarize large-scale details on a particular biomedical or clinical problem, which is then amenable for easy data management and further analysis. Furthermore, we survey novel deep learning methods that have recently been introduced for such tasks.
Collapse
Affiliation(s)
- Nadeesha Perera
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
| | - Matthias Dehmer
- Department of Mechatronics and Biomedical Computer Science, University for Health Sciences, Medical Informatics and Technology (UMIT), Hall in Tirol, Austria
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
- Faculty of Medicine and Health Technology, Institute of Biosciences and Medical Technology, Tampere University, Tampere, Finland
| |
Collapse
|
12
|
Shardlow M, Ju M, Li M, O'Reilly C, Iavarone E, McNaught J, Ananiadou S. A Text Mining Pipeline Using Active and Deep Learning Aimed at Curating Information in Computational Neuroscience. Neuroinformatics 2020; 17:391-406. [PMID: 30443819 PMCID: PMC6594987 DOI: 10.1007/s12021-018-9404-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
The curation of neuroscience entities is crucial to ongoing efforts in neuroinformatics and computational neuroscience, such as those being deployed in the context of continuing large-scale brain modelling projects. However, manually sifting through thousands of articles for new information about modelled entities is a painstaking and low-reward task. Text mining can be used to help a curator extract relevant information from this literature in a systematic way. We propose the application of text mining methods for the neuroscience literature. Specifically, two computational neuroscientists annotated a corpus of entities pertinent to neuroscience using active learning techniques to enable swift, targeted annotation. We then trained machine learning models to recognise the entities that have been identified. The entities covered are Neuron Types, Brain Regions, Experimental Values, Units, Ion Currents, Channels, and Conductances and Model organisms. We tested a traditional rule-based approach, a conditional random field and a model using deep learning named entity recognition, finding that the deep learning model was superior. Our final results show that we can detect a range of named entities of interest to the neuroscientist with a macro average precision, recall and F1 score of 0.866, 0.817 and 0.837 respectively. The contributions of this work are as follows: 1) We provide a set of Named Entity Recognition (NER) tools that are capable of detecting neuroscience entities with performance above or similar to prior work. 2) We propose a methodology for training NER tools for neuroscience that requires very little training data to get strong performance. This can be adapted for any sub-domain within neuroscience. 3) We provide a small corpus with annotations for multiple entity types, as well as annotation guidelines to help others reproduce our experiments.
Collapse
Affiliation(s)
- Matthew Shardlow
- The National Centre for Text Mining, School of Computer Science, University of Manchester, Road, M13 9PL, Oxford, UK
| | - Meizhi Ju
- The National Centre for Text Mining, School of Computer Science, University of Manchester, Road, M13 9PL, Oxford, UK
| | - Maolin Li
- The National Centre for Text Mining, School of Computer Science, University of Manchester, Road, M13 9PL, Oxford, UK
| | - Christian O'Reilly
- Blue Brain Project, EPFL, Campus Biotech, Ch. des Mines 9, CH-1202, Geneva, Switzerland
| | - Elisabetta Iavarone
- Blue Brain Project, EPFL, Campus Biotech, Ch. des Mines 9, CH-1202, Geneva, Switzerland
| | - John McNaught
- The National Centre for Text Mining, School of Computer Science, University of Manchester, Road, M13 9PL, Oxford, UK
| | - Sophia Ananiadou
- The National Centre for Text Mining, School of Computer Science, University of Manchester, Road, M13 9PL, Oxford, UK.
| |
Collapse
|
13
|
Gajowniczek K, Grzegorczyk I, Ząbkowski T, Bajaj C. Weighted Random Forests to Improve Arrhythmia Classification. ELECTRONICS 2020; 9. [PMID: 32051761 PMCID: PMC7015067 DOI: 10.3390/electronics9010099] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Construction of an ensemble model is a process of combining many diverse base predictive learners. It arises questions of how to weight each model and how to tune the parameters of the weighting process. The most straightforward approach is simply to average the base models. However, numerous studies have shown that a weighted ensemble can provide superior prediction results to a simple average of models. The main goals of this article are to propose a new weighting algorithm applicable for each tree in the Random Forest model and the comprehensive examination of the optimal parameter tuning. Importantly, the approach is motivated by its flexibility, good performance, stability, and resistance to overfitting. The proposed scheme is examined and evaluated on the Physionet/Computing in Cardiology Challenge 2015 data set. It consists of signals (electrocardiograms and pulsatory waveforms) from intensive care patients which triggered an alarm for five cardiac arrhythmia types (Asystole, Bradycardia, Tachycardia, Ventricular Tachycardia, and Ventricular Fultter/Fibrillation). The classification problem regards whether the alarm should or should not have been generated. It was proved that the proposed weighting approach improved classification accuracy for the three most challenging out of the five investigated arrhythmias comparing to the standard Random Forest model.
Collapse
Affiliation(s)
- Krzysztof Gajowniczek
- Department of Artificial Intelligence, Institute of Information Technology, Warsaw University of Life Sciences - SGGW, 02-776 Warsaw, Poland
- Correspondence: ; Tel.: +48-506-746-850
| | - Iga Grzegorczyk
- Department of Physics of Complex Systems, Faculty of Physics, Warsaw University of Technology, 00-662 Warsaw, Poland
| | - Tomasz Ząbkowski
- Department of Artificial Intelligence, Institute of Information Technology, Warsaw University of Life Sciences - SGGW, 02-776 Warsaw, Poland
| | - Chandrajit Bajaj
- Department of Computer Science, Institute for Computational Engineering and Sciences, University of Texas at Austin, Austin, TX 78712
| |
Collapse
|
14
|
Brasil S, Pascoal C, Francisco R, dos Reis Ferreira V, A. Videira P, Valadão G. Artificial Intelligence (AI) in Rare Diseases: Is the Future Brighter? Genes (Basel) 2019; 10:genes10120978. [PMID: 31783696 PMCID: PMC6947640 DOI: 10.3390/genes10120978] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Revised: 11/19/2019] [Accepted: 11/20/2019] [Indexed: 02/06/2023] Open
Abstract
The amount of data collected and managed in (bio)medicine is ever-increasing. Thus, there is a need to rapidly and efficiently collect, analyze, and characterize all this information. Artificial intelligence (AI), with an emphasis on deep learning, holds great promise in this area and is already being successfully applied to basic research, diagnosis, drug discovery, and clinical trials. Rare diseases (RDs), which are severely underrepresented in basic and clinical research, can particularly benefit from AI technologies. Of the more than 7000 RDs described worldwide, only 5% have a treatment. The ability of AI technologies to integrate and analyze data from different sources (e.g., multi-omics, patient registries, and so on) can be used to overcome RDs’ challenges (e.g., low diagnostic rates, reduced number of patients, geographical dispersion, and so on). Ultimately, RDs’ AI-mediated knowledge could significantly boost therapy development. Presently, there are AI approaches being used in RDs and this review aims to collect and summarize these advances. A section dedicated to congenital disorders of glycosylation (CDG), a particular group of orphan RDs that can serve as a potential study model for other common diseases and RDs, has also been included.
Collapse
Affiliation(s)
- Sandra Brasil
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
| | - Carlota Pascoal
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
- UCIBIO, Departamento Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
| | - Rita Francisco
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
- UCIBIO, Departamento Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
| | - Vanessa dos Reis Ferreira
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
- Correspondence:
| | - Paula A. Videira
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
- UCIBIO, Departamento Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
| | - Gonçalo Valadão
- Instituto de Telecomunicações, 1049-001 Lisboa, Portugal;
- Departamento de Ciências e Tecnologias, Autónoma Techlab–Universidade Autónoma de Lisboa, 1169-023 Lisboa, Portugal
- Electronics, Telecommunications and Computers Engineering Department, Instituto Superior de Engenharia de Lisboa, 1959-007 Lisboa, Portugal
| |
Collapse
|
15
|
Feng W, Dauphin G, Huang W, Quan Y, Liao W. New margin-based subsampling iterative technique in modified random forests for classification. Knowl Based Syst 2019. [DOI: 10.1016/j.knosys.2019.07.016] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
16
|
Nweke HF, Teh YW, Mujtaba G, Alo UR, Al-garadi MA. Multi-sensor fusion based on multiple classifier systems for human activity identification. HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES 2019. [DOI: 10.1186/s13673-019-0194-5] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Abstract
Multimodal sensors in healthcare applications have been increasingly researched because it facilitates automatic and comprehensive monitoring of human behaviors, high-intensity sports management, energy expenditure estimation, and postural detection. Recent studies have shown the importance of multi-sensor fusion to achieve robustness, high-performance generalization, provide diversity and tackle challenging issue that maybe difficult with single sensor values. The aim of this study is to propose an innovative multi-sensor fusion framework to improve human activity detection performances and reduce misrecognition rate. The study proposes a multi-view ensemble algorithm to integrate predicted values of different motion sensors. To this end, computationally efficient classification algorithms such as decision tree, logistic regression and k-Nearest Neighbors were used to implement diverse, flexible and dynamic human activity detection systems. To provide compact feature vector representation, we studied hybrid bio-inspired evolutionary search algorithm and correlation-based feature selection method and evaluate their impact on extracted feature vectors from individual sensor modality. Furthermore, we utilized Synthetic Over-sampling minority Techniques (SMOTE) algorithm to reduce the impact of class imbalance and improve performance results. With the above methods, this paper provides unified framework to resolve major challenges in human activity identification. The performance results obtained using two publicly available datasets showed significant improvement over baseline methods in the detection of specific activity details and reduced error rate. The performance results of our evaluation showed 3% to 24% improvement in accuracy, recall, precision, F-measure and detection ability (AUC) compared to single sensors and feature-level fusion. The benefit of the proposed multi-sensor fusion is the ability to utilize distinct feature characteristics of individual sensor and multiple classifier systems to improve recognition accuracy. In addition, the study suggests a promising potential of hybrid feature selection approach, diversity-based multiple classifier systems to improve mobile and wearable sensor-based human activity detection and health monitoring system.
Collapse
|
17
|
Lamurias A, Sousa D, Clarke LA, Couto FM. BO-LSTM: classifying relations via long short-term memory networks along biomedical ontologies. BMC Bioinformatics 2019; 20:10. [PMID: 30616557 PMCID: PMC6323831 DOI: 10.1186/s12859-018-2584-5] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Accepted: 12/12/2018] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Recent studies have proposed deep learning techniques, namely recurrent neural networks, to improve biomedical text mining tasks. However, these techniques rarely take advantage of existing domain-specific resources, such as ontologies. In Life and Health Sciences there is a vast and valuable set of such resources publicly available, which are continuously being updated. Biomedical ontologies are nowadays a mainstream approach to formalize existing knowledge about entities, such as genes, chemicals, phenotypes, and disorders. These resources contain supplementary information that may not be yet encoded in training data, particularly in domains with limited labeled data. RESULTS We propose a new model to detect and classify relations in text, BO-LSTM, that takes advantage of domain-specific ontologies, by representing each entity as the sequence of its ancestors in the ontology. We implemented BO-LSTM as a recurrent neural network with long short-term memory units and using open biomedical ontologies, specifically Chemical Entities of Biological Interest (ChEBI), Human Phenotype, and Gene Ontology. We assessed the performance of BO-LSTM with drug-drug interactions mentioned in a publicly available corpus from an international challenge, composed of 792 drug descriptions and 233 scientific abstracts. By using the domain-specific ontology in addition to word embeddings and WordNet, BO-LSTM improved the F1-score of both the detection and classification of drug-drug interactions, particularly in a document set with a limited number of annotations. We adapted an existing DDI extraction model with our ontology-based method, obtaining a higher F1 score than the original model. Furthermore, we developed and made available a corpus of 228 abstracts annotated with relations between genes and phenotypes, and demonstrated how BO-LSTM can be applied to other types of relations. CONCLUSIONS Our findings demonstrate that besides the high performance of current deep learning techniques, domain-specific ontologies can still be useful to mitigate the lack of labeled data.
Collapse
Affiliation(s)
- Andre Lamurias
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisboa, 1749 016 Portugal
- University of Lisboa, Faculty of Sciences, BioISI - Biosystems & Integrative Sciences Institute, Campo Grande, C8 bdg, Lisboa, 1749 016 Portugal
| | - Diana Sousa
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisboa, 1749 016 Portugal
| | - Luka A. Clarke
- University of Lisboa, Faculty of Sciences, BioISI - Biosystems & Integrative Sciences Institute, Campo Grande, C8 bdg, Lisboa, 1749 016 Portugal
| | - Francisco M. Couto
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisboa, 1749 016 Portugal
| |
Collapse
|
18
|
Improving the Classification of Q&A Content for Android Fragmentation Using Named Entity Recognition. PROGRESS IN ARTIFICIAL INTELLIGENCE 2019. [DOI: 10.1007/978-3-030-30244-3_60] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
19
|
Bhasuran B, Natarajan J. Automatic extraction of gene-disease associations from literature using joint ensemble learning. PLoS One 2018; 13:e0200699. [PMID: 30048465 PMCID: PMC6061985 DOI: 10.1371/journal.pone.0200699] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Accepted: 07/02/2018] [Indexed: 12/26/2022] Open
Abstract
A wealth of knowledge concerning relations between genes and its associated diseases is present in biomedical literature. Mining these biological associations from literature can provide immense support to research ranging from drug-targetable pathways to biomarker discovery. However, time and cost of manual curation heavily slows it down. In this current scenario one of the crucial technologies is biomedical text mining, and relation extraction shows the promising result to explore the research of genes associated with diseases. By developing automatic extraction of gene-disease associations from the literature using joint ensemble learning we addressed this problem from a text mining perspective. In the proposed work, we employ a supervised machine learning approach in which a rich feature set covering conceptual, syntax and semantic properties jointly learned with word embedding are trained using ensemble support vector machine for extracting gene-disease relations from four gold standard corpora. Upon evaluating the machine learning approach shows promised results of 85.34%, 83.93%,87.39% and 85.57% of F-measure on EUADR, GAD, CoMAGC and PolySearch corpora respectively. We strongly believe that the presented novel approach combining rich syntax and semantic feature set with domain-specific word embedding through ensemble support vector machines evaluated on four gold standard corpora can act as a new baseline for future works in gene-disease relation extraction from literature.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India
| | - Jeyakumar Natarajan
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India
- Data mining and Text mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India
- * E-mail:
| |
Collapse
|
20
|
Abd MT, Mohd M, Abd MT. Investigation of Data Representation Methods with Machine Learning Algorithms for Biomedical Named Enttity Recognition. 2018 FOURTH INTERNATIONAL CONFERENCE ON INFORMATION RETRIEVAL AND KNOWLEDGE MANAGEMENT (CAMP) 2018. [DOI: 10.1109/infrkm.2018.8464816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
21
|
Esteban S, Rodríguez Tablado M, Peper FE, Mahumud YS, Ricci RI, Kopitowski KS, Terrasa SA. Development and validation of various phenotyping algorithms for Diabetes Mellitus using data from electronic health records. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 152:53-70. [PMID: 29054261 DOI: 10.1016/j.cmpb.2017.09.009] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2017] [Revised: 08/19/2017] [Accepted: 09/13/2017] [Indexed: 06/07/2023]
Abstract
BACKGROUND AND OBJECTIVE Recent progression towards precision medicine has encouraged the use of electronic health records (EHRs) as a source for large amounts of data, which is required for studying the effect of treatments or risk factors in more specific subpopulations. Phenotyping algorithms allow to automatically classify patients according to their particular electronic phenotype thus facilitating the setup of retrospective cohorts. Our objective is to compare the performance of different classification strategies (only using standardized problems, rule-based algorithms, statistical learning algorithms (six learners) and stacked generalization (five versions)), for the categorization of patients according to their diabetic status (diabetics, not diabetics and inconclusive; Diabetes of any type) using information extracted from EHRs. METHODS Patient information was extracted from the EHR at Hospital Italiano de Buenos Aires, Buenos Aires, Argentina. For the derivation and validation datasets, two probabilistic samples of patients from different years (2005: n = 1663; 2015: n = 800) were extracted. The only inclusion criterion was age (≥40 & <80 years). Four researchers manually reviewed all records and classified patients according to their diabetic status (diabetic: diabetes registered as a health problem or fulfilling the ADA criteria; non-diabetic: not fulfilling the ADA criteria and having at least one fasting glycemia below 126 mg/dL; inconclusive: no data regarding their diabetic status or only one abnormal value). The best performing algorithms within each strategy were tested on the validation set. RESULTS The standardized codes algorithm achieved a Kappa coefficient value of 0.59 (95% CI 0.49, 0.59) in the validation set. The Boolean logic algorithm reached 0.82 (95% CI 0.76, 0.88). A slightly higher value was achieved by the Feedforward Neural Network (0.9, 95% CI 0.85, 0.94). The best performing learner was the stacked generalization meta-learner that reached a Kappa coefficient value of 0.95 (95% CI 0.91, 0.98). CONCLUSIONS The stacked generalization strategy and the feedforward neural network showed the best classification metrics in the validation set. The implementation of these algorithms enables the exploitation of the data of thousands of patients accurately.
Collapse
Affiliation(s)
- Santiago Esteban
- Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.; Research Department, Instituto Universitario Hospital Italiano de Buenos Aires, Buenos Aires, Argentina..
| | | | - Francisco E Peper
- Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina
| | - Yamila S Mahumud
- Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina
| | - Ricardo I Ricci
- Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina
| | - Karin S Kopitowski
- Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.; Research Department, Instituto Universitario Hospital Italiano de Buenos Aires, Buenos Aires, Argentina
| | - Sergio A Terrasa
- Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.; Public Health Department, Instituto Universitario Hospital Italiano de Buenos Aires, Buenos Aires, Argentina
| |
Collapse
|
22
|
Hosseini M, Jones J, Faiola A, Vreeman DJ, Wu H, Dixon BE. Reconciling disparate information in continuity of care documents: Piloting a system to consolidate structured clinical documents. J Biomed Inform 2017; 74:123-129. [PMID: 28903073 DOI: 10.1016/j.jbi.2017.09.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2017] [Revised: 07/07/2017] [Accepted: 09/02/2017] [Indexed: 11/19/2022]
Abstract
BACKGROUND Due to the nature of information generation in health care, clinical documents contain duplicate and sometimes conflicting information. Recent implementation of Health Information Exchange (HIE) mechanisms in which clinical summary documents are exchanged among disparate health care organizations can proliferate duplicate and conflicting information. MATERIALS AND METHODS To reduce information overload, a system to automatically consolidate information across multiple clinical summary documents was developed for an HIE network. The system receives any number of Continuity of Care Documents (CCDs) and outputs a single, consolidated record. To test the system, a randomly sampled corpus of 522 CCDs representing 50 unique patients was extracted from a large HIE network. The automated methods were compared to manual consolidation of information for three key sections of the CCD: problems, allergies, and medications. RESULTS Manual consolidation of 11,631 entries was completed in approximately 150h. The same data were automatically consolidated in 3.3min. The system successfully consolidated 99.1% of problems, 87.0% of allergies, and 91.7% of medications. Almost all of the inaccuracies were caused by issues involving the use of standardized terminologies within the documents to represent individual information entries. CONCLUSION This study represents a novel, tested tool for de-duplication and consolidation of CDA documents, which is a major step toward improving information access and the interoperability among information systems. While more work is necessary, automated systems like the one evaluated in this study will be necessary to meet the informatics needs of providers and health systems in the future.
Collapse
Affiliation(s)
- Masoud Hosseini
- Department of BioHealth Informatics, School of Informatics and Computing at Indiana University-Purdue University Indianapolis, Walker Plaza (WK), 719 Indiana Avenue, WK 117, Indianapolis, IN 46202, United States.
| | - Josette Jones
- Department of BioHealth Informatics, School of Informatics and Computing at Indiana University-Purdue University Indianapolis, Walker Plaza (WK), 719 Indiana Avenue, WK 117, Indianapolis, IN 46202, United States
| | - Anthony Faiola
- Biomedical and Health Information Sciences, College of Applied Health Sciences, The University of Illinois at Chicago, United States
| | | | - Huanmei Wu
- Department of BioHealth Informatics, School of Informatics and Computing at Indiana University-Purdue University Indianapolis, Walker Plaza (WK), 719 Indiana Avenue, WK 117, Indianapolis, IN 46202, United States
| | - Brian E Dixon
- Department of Epidemiology, Richard M. Fairbanks School of Public health, Indiana University-Purdue University Indianapolis, United States
| |
Collapse
|