1
|
Blum J, Wood L, Turner R. Artificial intelligence in the detection of choledocholithiasis: a systematic review. HPB (Oxford) 2025; 27:1-9. [PMID: 39406631 DOI: 10.1016/j.hpb.2024.09.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Revised: 09/10/2024] [Accepted: 09/19/2024] [Indexed: 01/06/2025]
Abstract
IMPORTANCE Choledocholithiasis is a potentially life-threatening manifestation of acute biliary dysfunction (ABD) often requiring magnetic resonance cholangiopancreatography (MRCP) for diagnosis when standard investigation findings are inconclusive. Machine learning models (MLMs) may offer alternatives to diagnose choledocholithiasis. OBJECTIVE This systematic review seeks to evaluate the performance of MLMs in predicting choledocholithiasis and to compare this performance with the American Society of Gastrointestinal Endoscopy (ASGE) guidelines. REVIEW This review adhered to PRISMA guidelines. Four databases were searched for relevant records published between January 2000 and April 2024. Two researchers appraised records. MLM performance and ASGE guideline efficacy were compared, and the clinical utility of MLMs was assessed. FINDINGS 408 records were screened; eight were eligible. Model accuracy ranged from 19 % to 97 %. Several records demonstrated a moderate-to-high risk of bias; of those featuring low risk of bias, peak accuracies ranged from 70 % to 85 %. Most MLMs outperformed ASGE guidelines. Important predictor variables included age, total bilirubin, and common bile duct diameter. CONCLUSIONS MLMs outperform ASGE guidelines in predicting choledocholithiasis. Nonetheless, biases in study design and reporting limit their prospective applicability. Current MLMs do not yet rival MRCP in detecting choledocholithiasis. Future guideline development should consider MLM-driven insights for better risk prediction.
Collapse
Affiliation(s)
- Joshua Blum
- Department of General Surgery, Royal Hobart Hospital, Hobart, Tasmania, Australia; Tasmanian School of Medicine, University of Tasmania, Hobart, Tasmania, Australia.
| | - Lewis Wood
- Department of Orthopaedic Surgery, Royal Hobart Hospital, Hobart, Tasmania, Australia
| | - Richard Turner
- Department of General Surgery, Royal Hobart Hospital, Hobart, Tasmania, Australia; Tasmanian School of Medicine, University of Tasmania, Hobart, Tasmania, Australia
| |
Collapse
|
2
|
Singh J, Khanna NN, Rout RK, Singh N, Laird JR, Singh IM, Kalra MK, Mantella LE, Johri AM, Isenovic ER, Fouda MM, Saba L, Fatemi M, Suri JS. GeneAI 3.0: powerful, novel, generalized hybrid and ensemble deep learning frameworks for miRNA species classification of stationary patterns from nucleotides. Sci Rep 2024; 14:7154. [PMID: 38531923 PMCID: PMC11344070 DOI: 10.1038/s41598-024-56786-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 03/11/2024] [Indexed: 03/28/2024] Open
Abstract
Due to the intricate relationship between the small non-coding ribonucleic acid (miRNA) sequences, the classification of miRNA species, namely Human, Gorilla, Rat, and Mouse is challenging. Previous methods are not robust and accurate. In this study, we present AtheroPoint's GeneAI 3.0, a powerful, novel, and generalized method for extracting features from the fixed patterns of purines and pyrimidines in each miRNA sequence in ensemble paradigms in machine learning (EML) and convolutional neural network (CNN)-based deep learning (EDL) frameworks. GeneAI 3.0 utilized five conventional (Entropy, Dissimilarity, Energy, Homogeneity, and Contrast), and three contemporary (Shannon entropy, Hurst exponent, Fractal dimension) features, to generate a composite feature set from given miRNA sequences which were then passed into our ML and DL classification framework. A set of 11 new classifiers was designed consisting of 5 EML and 6 EDL for binary/multiclass classification. It was benchmarked against 9 solo ML (SML), 6 solo DL (SDL), 12 hybrid DL (HDL) models, resulting in a total of 11 + 27 = 38 models were designed. Four hypotheses were formulated and validated using explainable AI (XAI) as well as reliability/statistical tests. The order of the mean performance using accuracy (ACC)/area-under-the-curve (AUC) of the 24 DL classifiers was: EDL > HDL > SDL. The mean performance of EDL models with CNN layers was superior to that without CNN layers by 0.73%/0.92%. Mean performance of EML models was superior to SML models with improvements of ACC/AUC by 6.24%/6.46%. EDL models performed significantly better than EML models, with a mean increase in ACC/AUC of 7.09%/6.96%. The GeneAI 3.0 tool produced expected XAI feature plots, and the statistical tests showed significant p-values. Ensemble models with composite features are highly effective and generalized models for effectively classifying miRNA sequences.
Collapse
Affiliation(s)
- Jaskaran Singh
- Department of Computer Science, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India
| | - Narendra N Khanna
- Department of Cardiology, Indraprastha APOLLO Hospitals, New Delhi, India
| | - Ranjeet K Rout
- Department of Computer Science and Engineering, NIT Srinagar, Hazratbal, Srinagar, India
| | - Narpinder Singh
- Department of Food Science, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India
| | - John R Laird
- Heart and Vascular Institute, Adventist Health St. Helena, St Helena, CA, USA
| | - Inder M Singh
- Advanced Cardiac and Vascular Institute, Sacramento, CA, USA
| | - Mannudeep K Kalra
- Department of Radiology, Massachusetts General Hospital, Boston, MA, 02115, USA
| | - Laura E Mantella
- Department of Biomedical and Molecular Sciences, Queen's University, Kingston, ON, Canada
| | - Amer M Johri
- Department of Biomedical and Molecular Sciences, Queen's University, Kingston, ON, Canada
| | - Esma R Isenovic
- Laboratory for Molecular Genetics and Radiobiology, University of Belgrade, Belgrade, Serbia
| | - Mostafa M Fouda
- Department of Electrical and Computer Engineering, Idaho State University, Pocatello, ID, 83209, USA
| | - Luca Saba
- Department of Neurology, University of Cagliari, Cagliari, Italy
| | - Mostafa Fatemi
- Department of Physiology and Biomedical Engineering, Mayo Clinic, Rochester, MN, 55905, USA
| | - Jasjit S Suri
- Stroke Monitoring and Diagnostic Division, AtheroPoint LLC, Roseville, CA, 95661, USA.
| |
Collapse
|
3
|
Li Y, Wu X, Fang D, Luo Y. Informing immunotherapy with multi-omics driven machine learning. NPJ Digit Med 2024; 7:67. [PMID: 38486092 PMCID: PMC10940614 DOI: 10.1038/s41746-024-01043-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2023] [Accepted: 02/14/2024] [Indexed: 03/18/2024] Open
Abstract
Progress in sequencing technologies and clinical experiments has revolutionized immunotherapy on solid and hematologic malignancies. However, the benefits of immunotherapy are limited to specific patient subsets, posing challenges for broader application. To improve its effectiveness, identifying biomarkers that can predict patient response is crucial. Machine learning (ML) play a pivotal role in harnessing multi-omic cancer datasets and unlocking new insights into immunotherapy. This review provides an overview of cutting-edge ML models applied in omics data for immunotherapy analysis, including immunotherapy response prediction and immunotherapy-relevant tumor microenvironment identification. We elucidate how ML leverages diverse data types to identify significant biomarkers, enhance our understanding of immunotherapy mechanisms, and optimize decision-making process. Additionally, we discuss current limitations and challenges of ML in this rapidly evolving field. Finally, we outline future directions aimed at overcoming these barriers and improving the efficiency of ML in immunotherapy research.
Collapse
Affiliation(s)
- Yawei Li
- Department of Preventive Medicine, Northwestern University, Feinberg School of Medicine, Chicago, IL, 60611, USA
- Center for Collaborative AI in Healthcare, Northwestern University, Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Xin Wu
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, 60612, USA
| | - Deyu Fang
- Department of Pathology, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University, Feinberg School of Medicine, Chicago, IL, 60611, USA.
- Center for Collaborative AI in Healthcare, Northwestern University, Feinberg School of Medicine, Chicago, IL, 60611, USA.
| |
Collapse
|
4
|
Paganini JA, Kerkvliet JJ, Vader L, Plantinga NL, Meneses R, Corander J, Willems RJL, Arredondo-Alonso S, Schürch AC. PlasmidEC and gplas2: an optimized short-read approach to predict and reconstruct antibiotic resistance plasmids in Escherichia coli. Microb Genom 2024; 10:001193. [PMID: 38376388 PMCID: PMC10926690 DOI: 10.1099/mgen.0.001193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 01/22/2024] [Indexed: 02/21/2024] Open
Abstract
Accurate reconstruction of Escherichia coli antibiotic resistance gene (ARG) plasmids from Illumina sequencing data has proven to be a challenge with current bioinformatic tools. In this work, we present an improved method to reconstruct E. coli plasmids using short reads. We developed plasmidEC, an ensemble classifier that identifies plasmid-derived contigs by combining the output of three different binary classification tools. We showed that plasmidEC is especially suited to classify contigs derived from ARG plasmids with a high recall of 0.941. Additionally, we optimized gplas, a graph-based tool that bins plasmid-predicted contigs into distinct plasmid predictions. Gplas2 is more effective at recovering plasmids with large sequencing coverage variations and can be combined with the output of any binary classifier. The combination of plasmidEC with gplas2 showed a high completeness (median=0.818) and F1-Score (median=0.812) when reconstructing ARG plasmids and exceeded the binning capacity of the reference-based method MOB-suite. In the absence of long-read data, our method offers an excellent alternative to reconstruct ARG plasmids in E. coli.
Collapse
Affiliation(s)
- Julian A. Paganini
- Department of Medical Microbiology, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Jesse J. Kerkvliet
- Department of Medical Microbiology, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Lisa Vader
- Department of Medical Microbiology, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Nienke L. Plantinga
- Department of Medical Microbiology, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Rodrigo Meneses
- Department of Medical Microbiology, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Jukka Corander
- Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo, Norway
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, UK
- Helsinki Institute of Information Technology, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland
| | - Rob J. L. Willems
- Department of Medical Microbiology, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Sergio Arredondo-Alonso
- Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo, Norway
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, UK
| | - Anita C. Schürch
- Department of Medical Microbiology, University Medical Center Utrecht, Utrecht, The Netherlands
| |
Collapse
|
5
|
Yao N, Pan J, Chen X, Li P, Li Y, Wang Z, Yao T, Qian L, Yi D, Wu Y. Discovery of potential biomarkers for lung cancer classification based on human proteome microarrays using Stochastic Gradient Boosting approach. J Cancer Res Clin Oncol 2023; 149:6803-6812. [PMID: 36807761 DOI: 10.1007/s00432-023-04643-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 02/08/2023] [Indexed: 02/21/2023]
Abstract
PURPOSE Early identification of lung cancer (LC) will considerably facilitate the intervention and prevention of LC. The human proteome micro-arrays approach can be used as a "liquid biopsy" to diagnose LC to complement conventional diagnosis, which needs advanced bioinformatics methods such as feature selection (FS) and refined machine learning models. METHODS A two-stage FS methodology by infusing Pearson's Correlation (PC) with a univariate filter (SBF) or recursive feature elimination (RFE) was used to reduce the redundancy of the original dataset. The Stochastic Gradient Boosting (SGB), Random Forest (RF), and Support Vector Machine (SVM) techniques were applied to build ensemble classifiers based on four subsets. The synthetic minority oversampling technique (SMOTE) was used in the preprocessing of imbalanced data. RESULTS FS approach with SBF and RFE extracted 25 and 55 features, respectively, with 14 overlapped ones. All three ensemble models demonstrate superior accuracy (ranging from 0.867 to 0.967) and sensitivity (0.917 to 1.00) in the test datasets with SGB of SBF subset outperforming others. The SMOTE technique has improved the model performance in the training process. Three of the top selected candidate biomarkers (LGR4, CDC34, and GHRHR) were highly suggested to play a role in lung tumorigenesis. CONCLUSION A novel hybrid FS method with classical ensemble machine learning algorithms was first used in the classification of protein microarray data. The parsimony model constructed by the SGB algorithm with the appropriate FS and SMOTE approach performs well in the classification task with higher sensitivity and specificity. Standardization and innovation of bioinformatics approach for protein microarray analysis need further exploration and validation.
Collapse
Affiliation(s)
- Ning Yao
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
- Chongqing Center for Disease Control and Prevention, No.8 Changjiang 2nd Street, Yuzhong District, Chongqing, 400042, China
| | - Jianbo Pan
- Center for Novel Target and Therapeutic Intervention, Institute of Life Sciences, Chongqing Medical University, Chongqing, 400016, China
| | - Xicheng Chen
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
| | - Pengpeng Li
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
| | - Yang Li
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
| | - Zhenyan Wang
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
| | - Tianhua Yao
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
| | - Li Qian
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
| | - Dong Yi
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China.
| | - Yazhou Wu
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China.
| |
Collapse
|
6
|
Roman-Naranjo P, Parra-Perez AM, Lopez-Escamez JA. A systematic review on machine learning approaches in the diagnosis and prognosis of rare genetic diseases. J Biomed Inform 2023:104429. [PMID: 37352901 DOI: 10.1016/j.jbi.2023.104429] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2023] [Revised: 06/05/2023] [Accepted: 06/17/2023] [Indexed: 06/25/2023]
Abstract
BACKGROUND The diagnosis of rare genetic diseases is often challenging due to the complexity of the genetic underpinnings of these conditions and the limited availability of diagnostic tools. Machine learning (ML) algorithms have the potential to improve the accuracy and speed of diagnosis by analyzing large amounts of genomic data and identifying complex multiallelic patterns that may be associated with specific diseases. In this systematic review, we aimed to identify the methodological trends and the ML application areas in rare genetic diseases. METHODS We performed a systematic review of the literature following the PRISMA guidelines to search studies that used ML approaches to enhance the diagnosis of rare genetic diseases. Studies that used DNA-based sequencing data and a variety of ML algorithms were included, summarized, and analyzed using bibliometric methods, visualization tools, and a feature co-occurrence analysis. FINDINGS Our search identified 22 studies that met the inclusion criteria. We found that exome sequencing was the most frequently used sequencing technology (59%), and rare neoplastic diseases were the most prevalent disease scenario (59%). In rare neoplasms, the most frequent applications of ML models were the differential diagnosis or stratification of patients (38.5%) and the identification of somatic mutations (30.8%). In other rare diseases, the most frequent goals were the prioritization of rare variants or genes (55.5%) and the identification of biallelic or digenic inheritance (33.3%). The most employed method was the random forest algorithm (54.5%). In addition, the features of the datasets needed for training these algorithms were distinctive depending on the goal pursued, including the mutational load in each gene for the differential diagnosis of patients, or the combination of genotype features and sequence-derived features (such as GC-content) for the identification of somatic mutations. CONCLUSIONS ML algorithms based on sequencing data are mainly used for the diagnosis of rare neoplastic diseases, with random forest being the most common approach. We identified key features in the datasets used for training these ML models according to the objective pursued. These features can support the development of future ML models in the diagnosis of rare genetic diseases.
Collapse
Affiliation(s)
- P Roman-Naranjo
- Division of Otolaryngology, Department of Surgery, Instituto de Investigación Biosanitaria, ibs.GRANADA, Universidad de Granada, Granada, Spain; Otology and Neurotology Group CTS495, Department of Genomic Medicine, GENYO - Centre for Genomics and Oncological Research - Pfizer, University of Granada, Junta de Andalucía, PTS, Granada, Spain; Sensorineural Pathology Programme, Centro de Investigación Biomédica en Red en Enfermedades Raras, CIBERER, Madrid, Spain.
| | - A M Parra-Perez
- Division of Otolaryngology, Department of Surgery, Instituto de Investigación Biosanitaria, ibs.GRANADA, Universidad de Granada, Granada, Spain; Otology and Neurotology Group CTS495, Department of Genomic Medicine, GENYO - Centre for Genomics and Oncological Research - Pfizer, University of Granada, Junta de Andalucía, PTS, Granada, Spain; Sensorineural Pathology Programme, Centro de Investigación Biomédica en Red en Enfermedades Raras, CIBERER, Madrid, Spain
| | - J A Lopez-Escamez
- Division of Otolaryngology, Department of Surgery, Instituto de Investigación Biosanitaria, ibs.GRANADA, Universidad de Granada, Granada, Spain; Otology and Neurotology Group CTS495, Department of Genomic Medicine, GENYO - Centre for Genomics and Oncological Research - Pfizer, University of Granada, Junta de Andalucía, PTS, Granada, Spain; Sensorineural Pathology Programme, Centro de Investigación Biomédica en Red en Enfermedades Raras, CIBERER, Madrid, Spain; Meniere's Disease Neuroscience Research Program, Faculty of Medicine & Health, School of Medical Sciences, The Kolling Institute, University of Sydney, Sydney, New South Wales, Australia
| |
Collapse
|
7
|
Pasha Syed AR, Anbalagan R, Setlur AS, Karunakaran C, Shetty J, Kumar J, Niranjan V. Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers. BMC Bioinformatics 2022; 23:496. [DOI: 10.1186/s12859-022-05050-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Accepted: 11/10/2022] [Indexed: 11/19/2022] Open
Abstract
AbstractClassification of different cancer types is an essential step in designing a decision support model for early cancer predictions. Using various machine learning (ML) techniques with ensemble learning is one such method used for classifications. In the present study, various ML algorithms were explored on twenty exome datasets, belonging to 5 cancer types. Initially, a data clean-up was carried out on 4181 variants of cancer with 88 features, and a derivative dataset was obtained using natural language processing and probabilistic distribution. An exploratory dataset analysis using principal component analysis was then performed in 1 and 2D axes to reduce the high-dimensionality of the data. To significantly reduce the imbalance in the derivative dataset, oversampling was carried out using SMOTE. Further, classification algorithms such as K-nearest neighbour and support vector machine were used initially on the oversampled dataset. A 4-layer artificial neural network model with 1D batch normalization was also designed to improve the model accuracy. Ensemble ML techniques such as bagging along with using KNN, SVM and MLPs as base classifiers to improve the weighted average performance metrics of the model. However, due to small sample size, model improvement was challenging. Therefore, a novel method to augment the sample size using generative adversarial network (GAN) and triplet based variational auto encoder (TVAE) was employed that reconstructed the features and labels generating the data. The results showed that from initial scrutiny, KNN showed a weighted average of 0.74 and SVM 0.76. Oversampling ensured that the accuracy of the derivative dataset improved significantly and the ensemble classifier augmented the accuracy to 82.91%, when the data was divided into 70:15:15 ratio (training, test and holdout datasets). The overall evaluation metric value when GAN and TVAE increased the sample size was found to be 0.92 with an overall comparison model of 0.66. Therefore, the present study designed an effective model for classifying cancers which when implemented to real world samples, will play a major role in early cancer diagnosis.
Collapse
|
8
|
Li Y, Wu X, Yang P, Jiang G, Luo Y. Machine Learning for Lung Cancer Diagnosis, Treatment, and Prognosis. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:850-866. [PMID: 36462630 PMCID: PMC10025752 DOI: 10.1016/j.gpb.2022.11.003] [Citation(s) in RCA: 72] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 10/03/2022] [Accepted: 11/17/2022] [Indexed: 12/03/2022]
Abstract
The recent development of imaging and sequencing technologies enables systematic advances in the clinical study of lung cancer. Meanwhile, the human mind is limited in effectively handling and fully utilizing the accumulation of such enormous amounts of data. Machine learning-based approaches play a critical role in integrating and analyzing these large and complex datasets, which have extensively characterized lung cancer through the use of different perspectives from these accrued data. In this review, we provide an overview of machine learning-based approaches that strengthen the varying aspects of lung cancer diagnosis and therapy, including early detection, auxiliary diagnosis, prognosis prediction, and immunotherapy practice. Moreover, we highlight the challenges and opportunities for future applications of machine learning in lung cancer.
Collapse
Affiliation(s)
- Yawei Li
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Xin Wu
- Department of Medicine, University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Ping Yang
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905 / Scottsdale, AZ 85259, USA
| | - Guoqian Jiang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55905, USA
| | - Yuan Luo
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA.
| |
Collapse
|
9
|
Chen X, Chen S, Song S, Gao Z, Hou L, Zhang X, Lv H, Jiang R. Cell type annotation of single-cell chromatin accessibility data via supervised Bayesian embedding. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-021-00432-w] [Citation(s) in RCA: 45] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
10
|
Krishnakumar R, Ruffing AM. OperonSEQer: A set of machine-learning algorithms with threshold voting for detection of operon pairs using short-read RNA-sequencing data. PLoS Comput Biol 2022; 18:e1009731. [PMID: 34986143 PMCID: PMC8765615 DOI: 10.1371/journal.pcbi.1009731] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Revised: 01/18/2022] [Accepted: 12/07/2021] [Indexed: 11/19/2022] Open
Abstract
Operon prediction in prokaryotes is critical not only for understanding the regulation of endogenous gene expression, but also for exogenous targeting of genes using newly developed tools such as CRISPR-based gene modulation. A number of methods have used transcriptomics data to predict operons, based on the premise that contiguous genes in an operon will be expressed at similar levels. While promising results have been observed using these methods, most of them do not address uncertainty caused by technical variability between experiments, which is especially relevant when the amount of data available is small. In addition, many existing methods do not provide the flexibility to determine the stringency with which genes should be evaluated for being in an operon pair. We present OperonSEQer, a set of machine learning algorithms that uses the statistic and p-value from a non-parametric analysis of variance test (Kruskal-Wallis) to determine the likelihood that two adjacent genes are expressed from the same RNA molecule. We implement a voting system to allow users to choose the stringency of operon calls depending on whether your priority is high recall or high specificity. In addition, we provide the code so that users can retrain the algorithm and re-establish hyperparameters based on any data they choose, allowing for this method to be expanded as additional data is generated. We show that our approach detects operon pairs that are missed by current methods by comparing our predictions to publicly available long-read sequencing data. OperonSEQer therefore improves on existing methods in terms of accuracy, flexibility, and adaptability. Bacteria and archaea, single-cell organisms collectively known as prokaryotes, live in all imaginable environments and comprise the majority of living organisms on this planet. Prokaryotes play a critical role in the homeostasis of multicellular organisms (such as animals and plants) and ecosystems. In addition, bacteria can be pathogenic and cause a variety of diseases in these same hosts and ecosystems. In short, understanding the biology and molecular functions of bacteria and archaea and devising mechanisms to engineer and optimize their properties are critical scientific endeavors with significant implications in healthcare, agriculture, manufacturing, and climate science among others. One major molecular difference between unicellular and multicellular organisms is the way they express genes–multicellular organisms make individual RNA molecules for each gene while, prokaryotes express operons (i.e., a group of genes coding functionally related proteins) in contiguous polycistronic RNA molecules. Understanding which genes exist within operons is critical for elucidating basic biology and for engineering organisms. In this work, we use a combination of statistical and machine learning-based methods to use next-generation sequencing data to predict operon structure across a range of prokaryotes. Our method provides an easily implemented, robust, accurate, and flexible way to determine operon structure in an organism-agnostic manner using readily available data.
Collapse
Affiliation(s)
- Raga Krishnakumar
- Systems Biology Department, Sandia National Laboratories, Livermore, California, United States of America
- * E-mail:
| | - Anne M. Ruffing
- Molecular and Microbiology Department, Sandia National Laboratories, Albuquerque, New Mexico, United States of America
| |
Collapse
|
11
|
Xiong Y, Ye M, Wu C. Cancer Classification with a Cost-Sensitive Naive Bayes Stacking Ensemble. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:5556992. [PMID: 33986823 PMCID: PMC8093037 DOI: 10.1155/2021/5556992] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Revised: 03/17/2021] [Accepted: 04/15/2021] [Indexed: 02/07/2023]
Abstract
Ensemble learning combines multiple learners to perform combinatorial learning, which has advantages of good flexibility and higher generalization performance. To achieve higher quality cancer classification, in this study, the fast correlation-based feature selection (FCBF) method was used to preprocess the data to eliminate irrelevant and redundant features. Then, the classification was carried out in the stacking ensemble learner. A library for support vector machine (LIBSVM), K-nearest neighbor (KNN), decision tree C4.5 (C4.5), and random forest (RF) were used as the primary learners of the stacking ensemble. Given the imbalanced characteristics of cancer gene expression data, the embedding cost-sensitive naive Bayes was used as the metalearner of the stacking ensemble, which was represented as CSNB stacking. The proposed CSNB stacking method was applied to nine cancer datasets to further verify the classification performance of the model. Compared with other classification methods, such as single classifier algorithms and ensemble algorithms, the experimental results showed the effectiveness and robustness of the proposed method in processing different types of cancer data. This method may therefore help guide cancer diagnosis and research.
Collapse
Affiliation(s)
- Yueling Xiong
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| | - Mingquan Ye
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| | - Changrong Wu
- School of Computer and Information, Anhui Normal University, Wuhu 241002, China
| |
Collapse
|