1
|
Batisti Biffignandi G, Chindelevitch L, Corbella M, Feil EJ, Sassera D, Lees JA. Optimising machine learning prediction of minimum inhibitory concentrations in Klebsiella pneumoniae. Microb Genom 2024; 10:001222. [PMID: 38529944 PMCID: PMC10995625 DOI: 10.1099/mgen.0.001222] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 03/07/2024] [Indexed: 03/27/2024] Open
Abstract
Minimum Inhibitory Concentrations (MICs) are the gold standard for quantitatively measuring antibiotic resistance. However, lab-based MIC determination can be time-consuming and suffers from low reproducibility, and interpretation as sensitive or resistant relies on guidelines which change over time. Genome sequencing and machine learning promise to allow in silico MIC prediction as an alternative approach which overcomes some of these difficulties, albeit the interpretation of MIC is still needed. Nevertheless, precisely how we should handle MIC data when dealing with predictive models remains unclear, since they are measured semi-quantitatively, with varying resolution, and are typically also left- and right-censored within varying ranges. We therefore investigated genome-based prediction of MICs in the pathogen Klebsiella pneumoniae using 4367 genomes with both simulated semi-quantitative traits and real MICs. As we were focused on clinical interpretation, we used interpretable rather than black-box machine learning models, namely, Elastic Net, Random Forests, and linear mixed models. Simulated traits were generated accounting for oligogenic, polygenic, and homoplastic genetic effects with different levels of heritability. Then we assessed how model prediction accuracy was affected when MICs were framed as regression and classification. Our results showed that treating the MICs differently depending on the number of concentration levels of antibiotic available was the most promising learning strategy. Specifically, to optimise both prediction accuracy and inference of the correct causal variants, we recommend considering the MICs as continuous and framing the learning problem as a regression when the number of observed antibiotic concentration levels is large, whereas with a smaller number of concentration levels they should be treated as a categorical variable and the learning problem should be framed as a classification. Our findings also underline how predictive models can be improved when prior biological knowledge is taken into account, due to the varying genetic architecture of each antibiotic resistance trait. Finally, we emphasise that incrementing the population database is pivotal for the future clinical implementation of these models to support routine machine-learning based diagnostics.
Collapse
Affiliation(s)
- Gherard Batisti Biffignandi
- Department of Biology and Biotechnology, University of Pavia, Pavia, Italy
- MRC Centre for Global Infectious Disease Analysis, Imperial College, London, England, UK
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Leonid Chindelevitch
- MRC Centre for Global Infectious Disease Analysis, Imperial College, London, England, UK
| | - Marta Corbella
- Microbiology and Virology Unit, Fondazione IRCCS Policlinico San Matteo, Pavia, Italy
| | - Edward J. Feil
- The Milner Centre for Evolution, Department of Life Sciences, University of Bath, Bath, UK
| | - Davide Sassera
- Department of Biology and Biotechnology, University of Pavia, Pavia, Italy
- Fondazione IRCCS Policlinico San Matteo, Pavia, Italy
| | - John A. Lees
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| |
Collapse
|
2
|
Mallawaarachchi S, Tonkin-Hill G, Croucher NJ, Turner P, Speed D, Corander J, Balding D. Genome-wide association, prediction and heritability in bacteria with application to Streptococcus pneumoniae. NAR Genom Bioinform 2022; 4:lqac011. [PMID: 35211669 PMCID: PMC8862724 DOI: 10.1093/nargab/lqac011] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 01/06/2022] [Accepted: 02/01/2022] [Indexed: 11/14/2022] Open
Abstract
Whole-genome sequencing has facilitated genome-wide analyses of association, prediction and heritability in many organisms. However, such analyses in bacteria are still in their infancy, being limited by difficulties including genome plasticity and strong population structure. Here we propose a suite of methods including linear mixed models, elastic net and LD-score regression, adapted to bacterial traits using innovations such as frequency-based allele coding, both insertion/deletion and nucleotide testing and heritability partitioning. We compare and validate our methods against the current state-of-art using simulations, and analyse three phenotypes of the major human pathogen Streptococcus pneumoniae, including the first analyses of minimum inhibitory concentrations (MIC) for penicillin and ceftriaxone. We show that the MIC traits are highly heritable with high prediction accuracy, explained by many genetic associations under good population structure control. In ceftriaxone MIC, this is surprising because none of the isolates are resistant as per the inhibition zone criteria. We estimate that half of the heritability of penicillin MIC is explained by a known drug-resistance region, which also contributes a quarter of the ceftriaxone MIC heritability. For the within-host carriage duration phenotype, no associations were observed, but the moderate heritability and prediction accuracy indicate a moderately polygenic trait.
Collapse
Affiliation(s)
| | - Gerry Tonkin-Hill
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge CB10 1SA, UK
| | - Nicholas J Croucher
- Faculty of Medicine, School of Public Health, Imperial College, London SW7 2AZ, UK
| | - Paul Turner
- Cambodia-Oxford Medical Research Unit, Angkor Hospital for Children, Siem Reap 1710, Cambodia,Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford OX3 7LG, UK
| | - Doug Speed
- Aarhus Institute of Advanced Studies (AIAS), Aarhus University, 8000 Aarhus, Denmark,Bioinformatics Research Centre, Aarhus University, 8000 Aarhus, Denmark,UCL Genetics Institute, University College London, London WC1E 6BT, United Kingdom
| | - Jukka Corander
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge CB10 1SA, UK,Department of Biostatistics, Faculty of Medicine, University of Oslo, 0372 Oslo, Norway,Helsinki Institute of Information Technology, Department of Mathematics and Statistics, University of Helsinki, Helsinki 00014, Finland
| | - David Balding
- Correspondence may also be addressed to David Balding.
| |
Collapse
|
3
|
Predicting Heritability of Oil Palm Breeding Using Phenotypic Traits and Machine Learning. SUSTAINABILITY 2021. [DOI: 10.3390/su132212613] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Oil palm is one of the main crops grown to help achieve sustainability in Malaysia. The selection of the best breeds will produce quality crops and increase crop yields. This study aimed to examine machine learning (ML) in oil palm breeding (OPB) using factors other than genetic data. A new conceptual framework to adopt the ML in OPB will be presented at the end of this paper. At first, data types, phenotype traits, current ML models, and evaluation technique will be identified through a literature survey. This study found that the phenotype and genotype data are widely used in oil palm breeding programs. The average bunch weight, bunch number, and fresh fruit bunch are the most important characteristics that can influence the genetic improvement of progenies. Although machine learning approaches have been applied to increase the productivity of the crop, most studies focus on molecular markers or genotypes for plant breeding, rather than on phenotype. Theoretically, the use of phenotypic data related to offspring should predict high breeding values by using ML. Therefore, a new ML conceptual framework to study the phenotype and progeny data of oil palm breeds will be discussed in relation to achieving the Sustainable Development Goals (SDGs).
Collapse
|