1
|
Miller C, Portlock T, Nyaga DM, O'Sullivan JM. A review of model evaluation metrics for machine learning in genetics and genomics. FRONTIERS IN BIOINFORMATICS 2024; 4:1457619. [PMID: 39318760 PMCID: PMC11420621 DOI: 10.3389/fbinf.2024.1457619] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Accepted: 08/27/2024] [Indexed: 09/26/2024] Open
Abstract
Machine learning (ML) has shown great promise in genetics and genomics where large and complex datasets have the potential to provide insight into many aspects of disease risk, pathogenesis of genetic disorders, and prediction of health and wellbeing. However, with this possibility there is a responsibility to exercise caution against biases and inflation of results that can have harmful unintended impacts. Therefore, researchers must understand the metrics used to evaluate ML models which can influence the critical interpretation of results. In this review we provide an overview of ML metrics for clustering, classification, and regression and highlight the advantages and disadvantages of each. We also detail common pitfalls that occur during model evaluation. Finally, we provide examples of how researchers can assess and utilise the results of ML models, specifically from a genomics perspective.
Collapse
Affiliation(s)
- Catriona Miller
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
| | - Theo Portlock
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
| | - Denis M Nyaga
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
| | - Justin M O'Sullivan
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
- The Maurice Wilkins Centre, The University of Auckland, Auckland, New Zealand
- MRC Lifecourse Epidemiology Unit, University of Southampton, Southampton, United Kingdom
- Singapore Institute for Clinical Sciences, Agency for Science Technology and Research, Singapore, Singapore
| |
Collapse
|
2
|
Ayres LB, Gomez FJV, Silva MF, Linton JR, Garcia CD. Predicting the formation of NADES using a transformer-based model. Sci Rep 2024; 14:2715. [PMID: 38388549 PMCID: PMC10883925 DOI: 10.1038/s41598-022-27106-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Accepted: 12/26/2022] [Indexed: 02/24/2024] Open
Abstract
The application of natural deep eutectic solvents (NADES) in the pharmaceutical, agricultural, and food industries represents one of the fastest growing fields of green chemistry, as these mixtures can potentially replace traditional organic solvents. These advances are, however, limited by the development of new NADES which is today, almost exclusively empirically driven and often derivative from known mixtures. To overcome this limitation, we propose the use of a transformer-based machine learning approach. Here, the transformer-based neural network model was first pre-trained to recognize chemical patterns from SMILES representations (unlabeled general chemical data) and then fine-tuned to recognize the patterns in strings that lead to the formation of either stable NADES or simple mixtures of compounds not leading to the formation of stable NADES (binary classification). Because this strategy was adapted from language learning, it allows the use of relatively small datasets and relatively low computational resources. The resulting algorithm is capable of predicting the formation of multiple new stable eutectic mixtures (n = 337) from a general database of natural compounds. More importantly, the system is also able to predict the components and molar ratios needed to render NADES with new molecules (not present in the training database), an aspect that was validated using previously reported NADES as well as by developing multiple novel solvents containing ibuprofen. We believe this strategy has the potential to transform the screening process for NADES as well as the pharmaceutical industry, streamlining the use of bioactive compounds as functional components of liquid formulations, rather than simple solutes.
Collapse
Affiliation(s)
- Lucas B Ayres
- Department of Chemistry, Clemson University, 211 S. Palmetto Blvd, Clemson, SC, 29634, USA
| | - Federico J V Gomez
- Facultad de Ciencias Agrarias, Instituto de Biología Agrícola de Mendoza (IBAM-CONICET), Universidad Nacional de Cuyo, Mendoza, Argentina
| | - Maria Fernanda Silva
- Facultad de Ciencias Agrarias, Instituto de Biología Agrícola de Mendoza (IBAM-CONICET), Universidad Nacional de Cuyo, Mendoza, Argentina
| | - Jeb R Linton
- Department of Chemistry, Clemson University, 211 S. Palmetto Blvd, Clemson, SC, 29634, USA
- IBM Cloud, Armonk, NY, 10504, USA
| | - Carlos D Garcia
- Department of Chemistry, Clemson University, 211 S. Palmetto Blvd, Clemson, SC, 29634, USA.
| |
Collapse
|
3
|
Ogunleye A, Piyawajanusorn C, Ghislat G, Ballester PJ. Large-Scale Machine Learning Analysis Reveals DNA Methylation and Gene Expression Response Signatures for Gemcitabine-Treated Pancreatic Cancer. HEALTH DATA SCIENCE 2024; 4:0108. [PMID: 38486621 PMCID: PMC10904073 DOI: 10.34133/hds.0108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/15/2023] [Accepted: 12/08/2023] [Indexed: 03/17/2024]
Abstract
Background: Gemcitabine is a first-line chemotherapy for pancreatic adenocarcinoma (PAAD), but many PAAD patients do not respond to gemcitabine-containing treatments. Being able to predict such nonresponders would hence permit the undelayed administration of more promising treatments while sparing gemcitabine life-threatening side effects for those patients. Unfortunately, the few predictors of PAAD patient response to this drug are weak, none of them exploiting yet the power of machine learning (ML). Methods: Here, we applied ML to predict the response of PAAD patients to gemcitabine from the molecular profiles of their tumors. More concretely, we collected diverse molecular profiles of PAAD patient tumors along with the corresponding clinical data (gemcitabine responses and clinical features) from the Genomic Data Commons resource. From systematically combining 8 tumor profiles with 16 classification algorithms, each of the resulting 128 ML models was evaluated by multiple 10-fold cross-validations. Results: Only 7 of these 128 models were predictive, which underlines the importance of carrying out such a large-scale analysis to avoid missing the most predictive models. These were here random forest using 4 selected mRNAs [0.44 Matthews correlation coefficient (MCC), 0.785 receiver operating characteristic-area under the curve (ROC-AUC)] and XGBoost combining 12 DNA methylation probes (0.32 MCC, 0.697 ROC-AUC). By contrast, the hENT1 marker obtained much worse random-level performance (practically 0 MCC, 0.5 ROC-AUC). Despite not being trained to predict prognosis (overall and progression-free survival), these ML models were also able to anticipate this patient outcome. Conclusions: We release these promising ML models so that they can be evaluated prospectively on other gemcitabine-treated PAAD patients.
Collapse
Affiliation(s)
- Adeolu Ogunleye
- Department of Organismal Biology,
Uppsala University, Uppsala, Sweden
| | | | - Ghita Ghislat
- Department of Life Sciences,
Imperial College London, London, UK
| | | |
Collapse
|
4
|
Shah OS, Chen F, Wedn A, Kashiparekh A, Knapick B, Chen J, Savariau L, Clifford B, Hooda J, Christgen M, Xavier J, Oesterreich S, Lee AV. Multi-omic characterization of ILC and ILC-like cell lines as part of ILC cell line encyclopedia (ICLE) defines new models to study potential biomarkers and explore therapeutic opportunities. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.26.559548. [PMID: 37808708 PMCID: PMC10557671 DOI: 10.1101/2023.09.26.559548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
Invasive lobular carcinoma (ILC), the most common histological "special type", accounts for ∼10-15% of all BC diagnoses, is characterized by unique features such as E-cadherin loss/deficiency, lower grade, hormone receptor positivity, larger diffuse tumors, and specific metastatic patterns. Despite ILC being acknowledged as a disease with distinct biology that necessitates specialized and precision medicine treatments, the further exploration of its molecular alterations with the goal of discovering new treatments has been hindered due to the scarcity of well-characterized cell line models for studying this disease. To address this, we generated the ILC Cell Line Encyclopedia (ICLE), providing a comprehensive multi-omic characterization of ILC and ILC-like cell lines. Using consensus multi-omic subtyping, we confirmed luminal status of previously established ILC cell lines and uncovered additional ILC/ILC-like cell lines with luminal features for modeling ILC disease. Furthermore, most of these luminal ILC/ILC-like cell lines also showed RNA and copy number similarity to ILC patient tumors. Similarly, ILC/ILC-like cell lines also retained molecular alterations in key ILC genes at similar frequency to both primary and metastatic ILC tumors. Importantly, ILC/ILC-like cell lines recapitulated the CDH1 alteration landscape of ILC patient tumors including enrichment of truncating mutations in and biallelic inactivation of CDH1 gene. Using whole-genome optical mapping, we uncovered novel genomic-rearrangements including novel structural variations in CDH1 and functional gene fusions and characterized breast cancer specific patterns of chromothripsis in chromosomes 8, 11 and 17. In addition, we systematically analyzed aberrant DNAm events and integrative analysis with RNA expression revealed epigenetic activation of TFAP2B - an emerging biomarker of lobular disease that is preferentially expressed in lobular disease. Finally, towards the goal of identifying novel druggable vulnerabilities in ILC, we analyzed publicly available RNAi loss of function breast cancer cell line datasets and revealed numerous putative vulnerabilities cytoskeletal components, focal adhesion and PI3K/AKT pathway in ILC/ILC-like vs NST cell lines. In summary, we addressed the lack of suitable models to study E-cadherin deficient breast cancers by first collecting both established and putative ILC models, then characterizing them comprehensively to show their molecular similarity to patient tumors along with uncovering their novel multi-omic features as well as highlighting putative novel druggable vulnerabilities. Not only we expand the array of suitable E-cadherin deficient cell lines available for modelling human-ILC disease but also employ them for studying epigenetic activation of a putative lobular biomarker as well as identifying potential druggable vulnerabilities for this disease towards enabling precision medicine research for human-ILC.
Collapse
|
5
|
Ogunleye AZ, Piyawajanusorn C, Gonçalves A, Ghislat G, Ballester PJ. Interpretable Machine Learning Models to Predict the Resistance of Breast Cancer Patients to Doxorubicin from Their microRNA Profiles. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2022; 9:e2201501. [PMID: 35785523 PMCID: PMC9403644 DOI: 10.1002/advs.202201501] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 06/02/2022] [Indexed: 05/05/2023]
Abstract
Doxorubicin is a common treatment for breast cancer. However, not all patients respond to this drug, which sometimes causes life-threatening side effects. Accurately anticipating doxorubicin-resistant patients would therefore permit to spare them this risk while considering alternative treatments without delay. Stratifying patients based on molecular markers in their pretreatment tumors is a promising approach to advance toward this ambitious goal, but single-gene gene markers such as HER2 expression have not shown to be sufficiently predictive. The recent availability of matched doxorubicin-response and diverse molecular profiles across breast cancer patients permits now analysis at a much larger scale. 16 machine learning algorithms and 8 molecular profiles are systematically evaluated on the same cohort of patients. Only 2 of the 128 resulting models are substantially predictive, showing that they can be easily missed by a standard-scale analysis. The best model is classification and regression tree (CART) nonlinearly combining 4 selected miRNA isoforms to predict doxorubicin response (median Matthew correlation coefficient (MCC) and area under the curve (AUC) of 0.56 and 0.80, respectively). By contrast, HER2 expression is significantly less predictive (median MCC and AUC of 0.14 and 0.57, respectively). As the predictive accuracy of this CART model increases with larger training sets, its update with future data should result in even better accuracy.
Collapse
Affiliation(s)
- Adeolu Z. Ogunleye
- Cancer Research Center of Marseille (CRCM)INSERM U1068MarseilleF‐13009France
- Cancer Research Center of Marseille (CRCM)Institut Paoli‐CalmettesMarseilleF‐13009France
- Cancer Research Center of Marseille (CRCM)Aix‐Marseille UniversitéMarseilleF‐13284France
- Cancer Research Center of Marseille (CRCM)CNRS UMR7258MarseilleF‐13009France
| | - Chayanit Piyawajanusorn
- Cancer Research Center of Marseille (CRCM)INSERM U1068MarseilleF‐13009France
- Cancer Research Center of Marseille (CRCM)Institut Paoli‐CalmettesMarseilleF‐13009France
- Cancer Research Center of Marseille (CRCM)Aix‐Marseille UniversitéMarseilleF‐13284France
- Cancer Research Center of Marseille (CRCM)CNRS UMR7258MarseilleF‐13009France
| | - Anthony Gonçalves
- Cancer Research Center of Marseille (CRCM)INSERM U1068MarseilleF‐13009France
- Cancer Research Center of Marseille (CRCM)Institut Paoli‐CalmettesMarseilleF‐13009France
- Cancer Research Center of Marseille (CRCM)Aix‐Marseille UniversitéMarseilleF‐13284France
- Cancer Research Center of Marseille (CRCM)CNRS UMR7258MarseilleF‐13009France
| | - Ghita Ghislat
- Cancer Research Center of Marseille (CRCM)INSERM U1068MarseilleF‐13009France
- Cancer Research Center of Marseille (CRCM)Institut Paoli‐CalmettesMarseilleF‐13009France
- Cancer Research Center of Marseille (CRCM)Aix‐Marseille UniversitéMarseilleF‐13284France
- Cancer Research Center of Marseille (CRCM)CNRS UMR7258MarseilleF‐13009France
| | - Pedro J. Ballester
- Cancer Research Center of Marseille (CRCM)INSERM U1068MarseilleF‐13009France
- Cancer Research Center of Marseille (CRCM)Institut Paoli‐CalmettesMarseilleF‐13009France
- Cancer Research Center of Marseille (CRCM)Aix‐Marseille UniversitéMarseilleF‐13284France
- Cancer Research Center of Marseille (CRCM)CNRS UMR7258MarseilleF‐13009France
- Department of BioengineeringImperial College LondonLondonSW7 2AZUK
| |
Collapse
|
6
|
Ba-Alawi W, Kadambat Nair S, Li B, Mammoliti A, Smirnov P, Mer AS, Penn LZ, Haibe-Kains B. Bimodal gene expression in cancer patients provides interpretable biomarkers for drug sensitivity. Cancer Res 2022; 82:2378-2387. [PMID: 35536872 DOI: 10.1158/0008-5472.can-21-2395] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Revised: 02/24/2022] [Accepted: 05/06/2022] [Indexed: 11/16/2022]
Abstract
Identifying biomarkers predictive of cancer cell response to drug treatment constitutes one of the main challenges in precision oncology. Recent large-scale cancer pharmacogenomic studies have opened new avenues of research to develop predictive biomarkers by profiling thousands of human cancer cell lines at the molecular level and screening them with hundreds of approved drugs and experimental chemical compounds. Many studies have leveraged these data to build predictive models of response using various statistical and machine learning methods. However, a common pitfall to these methods is the lack of interpretability as to how they make predictions, hindering the clinical translation of these models. To alleviate this issue, we used the recent logic modeling approach to develop a new machine learning pipeline that explores the space of bimodally expressed genes in multiple large in vitro pharmacogenomic studies and builds multivariate, nonlinear, yet interpretable logic-based models predictive of drug response. The performance of this approach was showcased in a compendium of the three largest in vitro pharmacogenomic data sets to build robust and interpretable models for 101 drugs that span 17 drug classes with high validation rates in independent datasets. These results along with in vivo and clinical validation, support a better translation of gene expression biomarkers between model systems using bimodal gene expression.
Collapse
Affiliation(s)
| | | | - Bo Li
- University of Toronto, Toronto, Canada
| | | | | | | | - Linda Z Penn
- Princess Margaret Cancer Centre, Toronto, Ontario, Canada
| | | |
Collapse
|
7
|
Lin YF, Liu JJ, Chang YJ, Yu CS, Yi W, Lane HY, Lu CH. Predicting Anticancer Drug Resistance Mediated by Mutations. Pharmaceuticals (Basel) 2022; 15:136. [PMID: 35215249 PMCID: PMC8878306 DOI: 10.3390/ph15020136] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/16/2022] [Accepted: 01/21/2022] [Indexed: 02/01/2023] Open
Abstract
Cancer drug resistance presents a challenge for precision medicine. Drug-resistant mutations are always emerging. In this study, we explored the relationship between drug-resistant mutations and drug resistance from the perspective of protein structure. By combining data from previously identified drug-resistant mutations and information of protein structure and function, we used machine learning-based methods to build models to predict cancer drug resistance mutations. The performance of our combined model achieved an accuracy of 86%, a Matthews correlation coefficient score of 0.57, and an F1 score of 0.66. We have constructed a fast, reliable method that predicts and investigates cancer drug resistance in a protein structure. Nonetheless, more information is needed concerning drug resistance and, in particular, clarification is needed about the relationships between the drug and the drug resistance mutations in proteins. Highly accurate predictions regarding drug resistance mutations can be helpful for developing new strategies with personalized cancer treatments. Our novel concept, which combines protein structure information, has the potential to elucidate physiological mechanisms of cancer drug resistance.
Collapse
Affiliation(s)
- Yu-Feng Lin
- Department of Medical Laboratory Science and Biotechnology, Asia University, Taichung 41354, Taiwan; (Y.-F.L.); (W.Y.)
| | - Jia-Jun Liu
- The Ph.D. Program of Biotechnology and Biomedical Industry, China Medical University, Taichung 40402, Taiwan; (J.-J.L.); (Y.-J.C.)
| | - Yu-Jen Chang
- The Ph.D. Program of Biotechnology and Biomedical Industry, China Medical University, Taichung 40402, Taiwan; (J.-J.L.); (Y.-J.C.)
| | - Chin-Sheng Yu
- Department of Information Engineering and Computer Science, Feng Chia University, Taichung 40201, Taiwan;
| | - Wei Yi
- Department of Medical Laboratory Science and Biotechnology, Asia University, Taichung 41354, Taiwan; (Y.-F.L.); (W.Y.)
| | - Hsien-Yuan Lane
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung 40402, Taiwan;
- Department of Psychiatry, China Medical University Hospital, Taichung 40402, Taiwan
- Brain Disease Research Center, China Medical University Hospital, Taichung 40402, Taiwan
| | - Chih-Hao Lu
- The Ph.D. Program of Biotechnology and Biomedical Industry, China Medical University, Taichung 40402, Taiwan; (J.-J.L.); (Y.-J.C.)
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung 40402, Taiwan;
- Department of Medical Laboratory Science and Biotechnology, China Medical University, Taichung 40402, Taiwan
| |
Collapse
|
8
|
Yuan X, Li Z, Xiong L, Song S, Zheng X, Tang Z, Yuan Z, Li L. Effective identification of varieties by nucleotide polymorphisms and its application for essentially derived variety identification in rice. BMC Bioinformatics 2022; 23:30. [PMID: 35012448 PMCID: PMC8751067 DOI: 10.1186/s12859-022-04562-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Accepted: 01/04/2022] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Plant variety identification is the one most important of agricultural systems. Development of DNA marker profiles of released varieties to compare with candidate variety or future variety is required. However, strictly speaking, scientists did not use most existing variety identification techniques for "identification" but for "distinction of a limited number of cultivars," of which generalization ability always not be well estimated. Because many varieties have similar genetic backgrounds, even some essentially derived varieties (EDVs) are involved, which brings difficulties for identification and breeding progress. A fast, accurate variety identification method, which also has good performance on EDV determination, needs to be developed. RESULTS In this study, with the strategy of "Divide and Conquer," a variety identification method Conditional Random Selection (CRS) method based on SNP of the whole genome of 3024 rice varieties was developed and be applied in essentially derived variety (EDV) identification of rice. CRS is a fast, efficient, and automated variety identification method. Meanwhile, in practical, with the optimal threshold of identity score searched in this study, the set of SNP (including 390 SNPs) showed optimal performance on EDV and non-EDV identification in two independent testing datasets. CONCLUSION This approach first selected a minimal set of SNPs to discriminate non-EDVs in the 3000 Rice Genome Project, then united several simplified SNP sets to improve its generalization ability for EDV and non-EDV identification in testing datasets. The results suggested that the CRS method outperformed traditional feature selection methods. Furthermore, it provides a new way to screen out core SNP loci from the whole genome for DNA fingerprinting of crop varieties and be useful for crop breeding.
Collapse
Affiliation(s)
- Xiong Yuan
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China
| | - Zirong Li
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China
| | - Liwen Xiong
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China
| | - Sufeng Song
- State Key Laboratory of Hybrid Rice, Hunan Hybrid Rice Research Center, Changsha, 410125, China
| | - Xingfei Zheng
- Hubei Key Laboratory of Food Crop Germplasm and Genetic Improvement, Food Crop Institute, Hubei Academy of Agricultural Sciences, Wuhan, 430064, China
| | - Zhonghai Tang
- College of Food Science and Technology, Hunan Agricultural University, Changsha, 410128, China
| | - Zheming Yuan
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China.
| | - Lanzhi Li
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China.
| |
Collapse
|
9
|
Network Biology and Artificial Intelligence Drive the Understanding of the Multidrug Resistance Phenotype in Cancer. Drug Resist Updat 2022; 60:100811. [DOI: 10.1016/j.drup.2022.100811] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Revised: 01/22/2022] [Accepted: 01/24/2022] [Indexed: 02/07/2023]
|
10
|
Nguyen LC, Naulaerts S, Bruna A, Ghislat G, Ballester PJ. Predicting Cancer Drug Response In Vivo by Learning an Optimal Feature Selection of Tumour Molecular Profiles. Biomedicines 2021; 9:biomedicines9101319. [PMID: 34680436 PMCID: PMC8533095 DOI: 10.3390/biomedicines9101319] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Revised: 09/22/2021] [Accepted: 09/23/2021] [Indexed: 12/17/2022] Open
Abstract
(1) Background: Inter-tumour heterogeneity is one of cancer’s most fundamental features. Patient stratification based on drug response prediction is hence needed for effective anti-cancer therapy. However, single-gene markers of response are rare and/or may fail to achieve a significant impact in the clinic. Machine Learning (ML) is emerging as a particularly promising complementary approach to precision oncology. (2) Methods: Here we leverage comprehensive Patient-Derived Xenograft (PDX) pharmacogenomic data sets with dimensionality-reducing ML algorithms with this purpose. (3) Results: Combining multiple gene alterations via ML leads to better discrimination between sensitive and resistant PDXs in 19 of the 26 analysed cases. Highly predictive ML models employing concise gene lists were found for three cases: paclitaxel (breast cancer), binimetinib (breast cancer) and cetuximab (colorectal cancer). Interestingly, each of these multi-gene ML models identifies some treatment-responsive PDXs not harbouring the best actionable mutation for that case. Thus, ML multi-gene predictors generally have much fewer false negatives than the corresponding single-gene marker. (4) Conclusions: As PDXs often recapitulate clinical outcomes, these results suggest that many more patients could benefit from precision oncology if ML algorithms were also applied to existing clinical pharmacogenomics data, especially those algorithms generating classifiers combining data-selected gene alterations.
Collapse
Affiliation(s)
- Linh C. Nguyen
- Cancer Research Center of Marseille, INSERM U1068, F-13009 Marseille, France;
- Institut Paoli-Calmettes, F-13009 Marseille, France
- Aix-Marseille Université UM105, F-13009 Marseille, France
- CNRS UMR7258, F-13009 Marseille, France
- Department of Life Sciences, University of Science and Technology of Hanoi, Vietnam Academy of Science and Technology, Hanoi 100803, Vietnam
| | - Stefan Naulaerts
- Ludwig Institute for Cancer Research, 1200 Brussels, Belgium;
- Duve Institute, UCLouvain, 1200 Brussels, Belgium
| | | | - Ghita Ghislat
- Centre d’Immunologie de Marseille-Luminy, INSERM U1104, CNRS UMR7280, F-13009 Marseille, France;
| | - Pedro J. Ballester
- Cancer Research Center of Marseille, INSERM U1068, F-13009 Marseille, France;
- Institut Paoli-Calmettes, F-13009 Marseille, France
- Aix-Marseille Université UM105, F-13009 Marseille, France
- CNRS UMR7258, F-13009 Marseille, France
- Correspondence: ; Tel.: + 33-(0)4-8697-7201
| |
Collapse
|
11
|
Piyawajanusorn C, Nguyen LC, Ghislat G, Ballester PJ. A gentle introduction to understanding preclinical data for cancer pharmaco-omic modeling. Brief Bioinform 2021; 22:6343527. [PMID: 34368843 DOI: 10.1093/bib/bbab312] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Revised: 06/25/2021] [Accepted: 07/20/2021] [Indexed: 12/16/2022] Open
Abstract
A central goal of precision oncology is to administer an optimal drug treatment to each cancer patient. A common preclinical approach to tackle this problem has been to characterize the tumors of patients at the molecular and drug response levels, and employ the resulting datasets for predictive in silico modeling (mostly using machine learning). Understanding how and why the different variants of these datasets are generated is an important component of this process. This review focuses on providing such introduction aimed at scientists with little previous exposure to this research area.
Collapse
Affiliation(s)
- Chayanit Piyawajanusorn
- Cancer Research Center of Marseille, INSERM U1068, F-13009 Marseille, France.,Institut Paoli-Calmettes, F-13009 Marseille, France.,Aix-Marseille Université, F-13284 Marseille, France.,CNRS UMR7258, F-13009 Marseille, France.,Faculty of Medicine and Public Health, HRH Princess Chulabhorn College of Medical Science, Chulabhorn Royal Academy, Bangkok, Thailand
| | - Linh C Nguyen
- Cancer Research Center of Marseille, INSERM U1068, F-13009 Marseille, France.,Institut Paoli-Calmettes, F-13009 Marseille, France.,Aix-Marseille Université, F-13284 Marseille, France.,CNRS UMR7258, F-13009 Marseille, France.,Department of Life Sciences, University of Science and Technology of Hanoi, Vietnam Academy of Science and Technology, Hanoi, Vietnam
| | - Ghita Ghislat
- U1104, CNRS UMR7280, Centre d'Immunologie de Marseille-Luminy, Inserm, Marseille, France
| | - Pedro J Ballester
- Cancer Research Center of Marseille, INSERM U1068, F-13009 Marseille, France.,Institut Paoli-Calmettes, F-13009 Marseille, France.,Aix-Marseille Université, F-13284 Marseille, France.,CNRS UMR7258, F-13009 Marseille, France
| |
Collapse
|
12
|
Park S, Soh J, Lee H. Super.FELT: supervised feature extraction learning using triplet loss for drug response prediction with multi-omics data. BMC Bioinformatics 2021; 22:269. [PMID: 34034645 PMCID: PMC8152321 DOI: 10.1186/s12859-021-04146-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Accepted: 04/22/2021] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Predicting the drug response of a patient is important for precision oncology. In recent studies, multi-omics data have been used to improve the prediction accuracy of drug response. Although multi-omics data are good resources for drug response prediction, the large dimension of data tends to hinder performance improvement. In this study, we aimed to develop a new method, which can effectively reduce the large dimension of data, based on the supervised deep learning model for predicting drug response. RESULTS We proposed a novel method called Supervised Feature Extraction Learning using Triplet loss (Super.FELT) for drug response prediction. Super.FELT consists of three stages, namely, feature selection, feature encoding using a supervised method, and binary classification of drug response (sensitive or resistant). We used multi-omics data including mutation, copy number aberration, and gene expression, and these were obtained from cell lines [Genomics of Drug Sensitivity in Cancer (GDSC), Cancer Cell Line Encyclopedia (CCLE), and Cancer Therapeutics Response Portal (CTRP)], patient-derived tumor xenografts (PDX), and The Cancer Genome Atlas (TCGA). GDSC was used for training and cross-validation tests, and CCLE, CTRP, PDX, and TCGA were used for external validation. We performed ablation studies for the three stages and verified that the use of multi-omics data guarantees better performance of drug response prediction. Our results verified that Super.FELT outperformed the other methods at external validation on PDX and TCGA and was good at cross-validation on GDSC and external validation on CCLE and CTRP. In addition, through our experiments, we confirmed that using multi-omics data is useful for external non-cell line data. CONCLUSION By separating the three stages, Super.FELT achieved better performance than the other methods. Through our results, we found that it is important to train encoders and a classifier independently, especially for external test on PDX and TCGA. Moreover, although gene expression is the most powerful data on cell line data, multi-omics promises better performance for external validation on non-cell line data than gene expression data. Source codes of Super.FELT are available at https://github.com/DMCB-GIST/Super.FELT .
Collapse
Affiliation(s)
- Sejin Park
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South Korea
| | - Jihee Soh
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South Korea
| | - Hyunju Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South Korea.
- Graduate School of Artificial Intelligence, Gwangju Institute of Science and Technology, Gwangju, South Korea.
| |
Collapse
|
13
|
Naulaerts S, Menden MP, Ballester PJ. Concise Polygenic Models for Cancer-Specific Identification of Drug-Sensitive Tumors from Their Multi-Omics Profiles. Biomolecules 2020; 10:E963. [PMID: 32604779 PMCID: PMC7356608 DOI: 10.3390/biom10060963] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2020] [Revised: 06/20/2020] [Accepted: 06/22/2020] [Indexed: 12/15/2022] Open
Abstract
In silico models to predict which tumors will respond to a given drug are necessary for Precision Oncology. However, predictive models are only available for a handful of cases (each case being a given drug acting on tumors of a specific cancer type). A way to generate predictive models for the remaining cases is with suitable machine learning algorithms that are yet to be applied to existing in vitro pharmacogenomics datasets. Here, we apply XGBoost integrated with a stringent feature selection approach, which is an algorithm that is advantageous for these high-dimensional problems. Thus, we identified and validated 118 predictive models for 62 drugs across five cancer types by exploiting four molecular profiles (sequence mutations, copy-number alterations, gene expression, and DNA methylation). Predictive models were found in each cancer type and with every molecular profile. On average, no omics profile or cancer type obtained models with higher predictive accuracy than the rest. However, within a given cancer type, some molecular profiles were overrepresented among predictive models. For instance, CNA profiles were predictive in breast invasive carcinoma (BRCA) cell lines, but not in small cell lung cancer (SCLC) cell lines where gene expression (GEX) and DNA methylation profiles were the most predictive. Lastly, we identified the best XGBoost model per cancer type and analyzed their selected features. For each model, some of the genes in the selected list had already been found to be individually linked to the response to that drug, providing additional evidence of the usefulness of these models and the merits of the feature selection scheme.
Collapse
Affiliation(s)
- Stefan Naulaerts
- Cancer Research Center of Marseille, INSERM U1068, F-13009 Marseille, France;
- Institut Paoli-Calmettes, F-13009 Marseille, France
- Aix-Marseille Université, F-13284 Marseille, France
- CNRS UMR7258, F-13009 Marseille, France
- Ludwig Institute for Cancer Research, de Duve Institute, Université catholique de Louvain, 1200 Brussels, Belgium
| | - Michael P. Menden
- Institute of Computational Biology, Helmholtz Zentrum München—German Research Center for Environmental Health, 85764 Neuherberg, Germany;
- Department of Biology, Ludwig-Maximilians University Munich, 82152 Planegg-Martinsried, Germany
- German Centre for Diabetes Research (DZD e.V.), 85764 Neuherberg, Germany
| | - Pedro J. Ballester
- Cancer Research Center of Marseille, INSERM U1068, F-13009 Marseille, France;
- Institut Paoli-Calmettes, F-13009 Marseille, France
- Aix-Marseille Université, F-13284 Marseille, France
- CNRS UMR7258, F-13009 Marseille, France
| |
Collapse
|
14
|
Onecha E, Ruiz-Heredia Y, Martínez-Cuadrón D, Barragán E, Martinez-Sanchez P, Linares M, Rapado I, Perez-Oteyza J, Magro E, Herrera P, Rojas JL, Gorrochategui J, Villoria J, Boluda B, Sargas C, Ballesteros J, Montesinos P, Martínez-López J, Ayala R. Improving the prediction of acute myeloid leukaemia outcomes by complementing mutational profiling with ex vivo chemosensitivity. Br J Haematol 2020; 189:672-683. [PMID: 32068246 DOI: 10.1111/bjh.16432] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2019] [Revised: 10/31/2019] [Accepted: 11/01/2019] [Indexed: 02/06/2023]
Abstract
Refractoriness to induction therapy and relapse after complete remission are the leading causes of death in patients with acute myeloid leukaemia (AML). This study focussed on the prediction of response to standard induction therapy and outcome of patients with AML using a combined strategy of mutational profiling by next-generation sequencing (NGS, n = 190) and ex vivo PharmaFlow testing (n = 74) for the 10 most widely used drugs for AML induction therapy, in a cohort of adult patients uniformly treated according to Spanish PETHEMA guidelines. We identified an adverse mutational profile (EZH2, KMT2A, U2AF1 and/or TP53 mutations) that carries a greater risk of death [hazard ratio (HR): 3·29, P < 0·0001]. A high correlation was found between the ex vivo PharmaFlow results and clinical induction response (69%). Clinical correlation analysis showed that the pattern of multiresistance revealed by ex vivo PharmaFlow identified patients with a high risk of death (HR: 2·58). Patients with mutation status also ran a high risk (HR 4·19), and the risk was increased further in patients with both adverse profiles (HR 4·82). We have developed a new score based on NGS and ex vivo drug testing for AML patients that improves upon current prognostic risk stratification and allows clinicians to tailor treatments to minimise drug resistance.
Collapse
Affiliation(s)
- Esther Onecha
- Hematology Department, Hospital Universitario 12 de Octubre, Madrid, Spain.,Hematological Malignancies Clinical Research Unit, CNIO, Madrid, Spain
| | - Yanira Ruiz-Heredia
- Hematological Malignancies Clinical Research Unit, CNIO, Madrid, Spain.,Vivia Biotech, Tres Cantos, Madrid, Spain
| | - David Martínez-Cuadrón
- Department of Hematology, Hospital Universitari i Politècnic La Fe, Valencia, Madrid, Spain
| | - Eva Barragán
- Department of Hematology, Hospital Universitari i Politècnic La Fe, Valencia, Madrid, Spain
| | - Pilar Martinez-Sanchez
- Hematology Department, Hospital Universitario 12 de Octubre, Madrid, Spain.,Hematological Malignancies Clinical Research Unit, CNIO, Madrid, Spain.,Complutense University, Madrid, Spain
| | - María Linares
- Hematology Department, Hospital Universitario 12 de Octubre, Madrid, Spain.,Hematological Malignancies Clinical Research Unit, CNIO, Madrid, Spain.,Complutense University, Madrid, Spain
| | - Inmaculada Rapado
- Hematology Department, Hospital Universitario 12 de Octubre, Madrid, Spain.,Hematological Malignancies Clinical Research Unit, CNIO, Madrid, Spain.,CIBERONC, Instituto Carlos III, Madrid, Spain
| | - Jaime Perez-Oteyza
- Hematology Department, Hospital Universitario Sanchinarro, Madrid, Spain
| | - Elena Magro
- Hematology Department, Hospital Universitario Principe de Asturias, Madrid, Spain
| | - Pilar Herrera
- Hematology Department, Hospital Universitario Ramon y Cajal, Madrid, Spain
| | | | | | | | - Blanca Boluda
- Department of Hematology, Hospital Universitari i Politècnic La Fe, Valencia, Madrid, Spain
| | - Claudia Sargas
- Department of Hematology, Hospital Universitari i Politècnic La Fe, Valencia, Madrid, Spain
| | | | - Pau Montesinos
- Department of Hematology, Hospital Universitari i Politècnic La Fe, Valencia, Madrid, Spain
| | - Joaquín Martínez-López
- Hematology Department, Hospital Universitario 12 de Octubre, Madrid, Spain.,Hematological Malignancies Clinical Research Unit, CNIO, Madrid, Spain.,Complutense University, Madrid, Spain.,CIBERONC, Instituto Carlos III, Madrid, Spain
| | - Rosa Ayala
- Hematology Department, Hospital Universitario 12 de Octubre, Madrid, Spain.,Hematological Malignancies Clinical Research Unit, CNIO, Madrid, Spain.,Complutense University, Madrid, Spain.,CIBERONC, Instituto Carlos III, Madrid, Spain
| |
Collapse
|
15
|
Baptista D, Ferreira PG, Rocha M. Deep learning for drug response prediction in cancer. Brief Bioinform 2020; 22:360-379. [PMID: 31950132 DOI: 10.1093/bib/bbz171] [Citation(s) in RCA: 112] [Impact Index Per Article: 22.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Revised: 11/04/2019] [Indexed: 01/15/2023] Open
Abstract
Predicting the sensitivity of tumors to specific anti-cancer treatments is a challenge of paramount importance for precision medicine. Machine learning(ML) algorithms can be trained on high-throughput screening data to develop models that are able to predict the response of cancer cell lines and patients to novel drugs or drug combinations. Deep learning (DL) refers to a distinct class of ML algorithms that have achieved top-level performance in a variety of fields, including drug discovery. These types of models have unique characteristics that may make them more suitable for the complex task of modeling drug response based on both biological and chemical data, but the application of DL to drug response prediction has been unexplored until very recently. The few studies that have been published have shown promising results, and the use of DL for drug response prediction is beginning to attract greater interest from researchers in the field. In this article, we critically review recently published studies that have employed DL methods to predict drug response in cancer cell lines. We also provide a brief description of DL and the main types of architectures that have been used in these studies. Additionally, we present a selection of publicly available drug screening data resources that can be used to develop drug response prediction models. Finally, we also address the limitations of these approaches and provide a discussion on possible paths for further improvement. Contact: mrocha@di.uminho.pt.
Collapse
Affiliation(s)
| | | | - Miguel Rocha
- Department of Informatics and a Senior Researcher of the Centre of Biological Engineering at the University of Minho
| |
Collapse
|
16
|
Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 2020; 21:6. [PMID: 31898477 PMCID: PMC6941312 DOI: 10.1186/s12864-019-6413-7] [Citation(s) in RCA: 1451] [Impact Index Per Article: 290.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2019] [Accepted: 12/18/2019] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets. RESULTS The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset. CONCLUSIONS In this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F1 score in evaluating binary classification tasks by all scientific communities.
Collapse
Affiliation(s)
- Davide Chicco
- Krembil Research Institute, Toronto, Ontario, Canada
- Peter Munk Cardiac Centre, Toronto, Ontario, Canada
| | | |
Collapse
|
17
|
Bomane A, Gonçalves A, Ballester PJ. Paclitaxel Response Can Be Predicted With Interpretable Multi-Variate Classifiers Exploiting DNA-Methylation and miRNA Data. Front Genet 2019; 10:1041. [PMID: 31708973 PMCID: PMC6823251 DOI: 10.3389/fgene.2019.01041] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Accepted: 09/30/2019] [Indexed: 12/27/2022] Open
Abstract
To address the problem of resistance to paclitaxel treatment, we have investigated to which extent is possible to predict Breast Cancer (BC) patient response to this drug. We carried out a large-scale tumor-based prediction analysis using data from the US National Cancer Institute’s Genomic Data Commons. These data sets comprise the responses of BC patients to paclitaxel along with six molecular profiles of their tumors. We assessed 10 Machine Learning (ML) algorithms on each of these profiles and evaluated the resulting 60 classifiers on the same BC patients. DNA methylation and miRNA profiles were the most informative overall. In combination with these two profiles, ML algorithms selecting the smallest subset of molecular features generated the most predictive classifiers: a complexity-optimized XGBoost classifier based on CpG island methylation extracted a subset of molecular factors relevant to predict paclitaxel response (AUC = 0.74). A CpG site methylation-based Decision Tree (DT) combining only 2 of the 22,941 considered CpG sites (AUC = 0.89) and a miRNA expression-based DT employing just 4 of the 337 analyzed mature miRNAs (AUC = 0.72) reveal the molecular types associated to paclitaxel-sensitive and resistant BC tumors. A literature review shows that features selected by these three classifiers have been individually linked to the cytotoxic-drug sensitivities and prognosis of BC patients. Our work leads to several molecular signatures, unearthed from methylome and miRNome, able to anticipate to some extent which BC tumors respond or not to paclitaxel. These results may provide insights to optimize paclitaxel-therapies in clinical practice.
Collapse
Affiliation(s)
- Alexandra Bomane
- Cancer Research Center of Marseille, CRCM, INSERM, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Paris, France
| | - Anthony Gonçalves
- Cancer Research Center of Marseille, CRCM, INSERM, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Paris, France
| | - Pedro J Ballester
- Cancer Research Center of Marseille, CRCM, INSERM, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Paris, France
| |
Collapse
|
18
|
Tolios A, De Las Rivas J, Hovig E, Trouillas P, Scorilas A, Mohr T. Computational approaches in cancer multidrug resistance research: Identification of potential biomarkers, drug targets and drug-target interactions. Drug Resist Updat 2019; 48:100662. [PMID: 31927437 DOI: 10.1016/j.drup.2019.100662] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2019] [Revised: 10/15/2019] [Accepted: 10/17/2019] [Indexed: 02/07/2023]
Abstract
Like physics in the 19th century, biology and molecular biology in particular, has been fertilized and enhanced like few other scientific fields, by the incorporation of mathematical methods. In the last decades, a whole new scientific field, bioinformatics, has developed with an output of over 30,000 papers a year (Pubmed search using the keyword "bioinformatics"). Huge databases of mass throughput data have been established, with ArrayExpress alone containing more than 2.7 million assays (October 2019). Computational methods have become indispensable tools in molecular biology, particularly in one of the most challenging areas of cancer research, multidrug resistance (MDR). However, confronted with a plethora of different algorithms, approaches, and methods, the average researcher faces key questions: Which methods do exist? Which methods can be used to tackle the aims of a given study? Or, more generally, how do I use computational biology/bioinformatics to bolster my research? The current review is aimed at providing guidance to existing methods with relevance to MDR research. In particular, we provide an overview on: a) the identification of potential biomarkers using expression data; b) the prediction of treatment response by machine learning methods; c) the employment of network approaches to identify gene/protein regulatory networks and potential key players; d) the identification of drug-target interactions; e) the use of bipartite networks to identify multidrug targets; f) the identification of cellular subpopulations with the MDR phenotype; and, finally, g) the use of molecular modeling methods to guide and enhance drug discovery. This review shall serve as a guide through some of the basic concepts useful in MDR research. It shall give the reader some ideas about the possibilities in MDR research by using computational tools, and, finally, it shall provide a short overview of relevant literature.
Collapse
Affiliation(s)
- A Tolios
- Department of Blood Group Serology and Transfusion Medicine, Medical University of Vienna, Vienna, Austria; Department of Laboratory Medicine, Medical University of Vienna, Vienna, Austria; Institute of Clinical Chemistry and Laboratory Medicine, Heinrich Heine University, Duesseldorf, Germany.
| | - J De Las Rivas
- Bioinformatics and Functional Genomics Group, Cancer Research Center (CiC-IMBCC, CSIC/USAL/IBSAL), Consejo Superior de Investigaciones Científicas (CSIC) and University of Salamanca (USAL), Campus Miguel de Unamuno s/n, Salamanca, Spain.
| | - E Hovig
- Department of Tumor Biology, Institute for Cancer Research, Oslo University Hospital and Center for Bioinformatics, Department of Informatics, University of Oslo, Oslo, Norway.
| | - P Trouillas
- UMR 1248 INSERM, Univ. Limoges, 2 rue du Dr Marland, 87052, Limoges, France; RCPTM, University Palacký of Olomouc, tr. 17. listopadu 12, 771 46, Olomouc, Czech Republic.
| | - A Scorilas
- Department of Biochemistry & Molecular Biology, Faculty of Biology, National and Kapodistrian University of Athens, Athens, Greece.
| | - T Mohr
- Institute of Cancer Research, Department of Medicine I, Medical University of Vienna, Vienna, Austria; ScienceConsult - DI Thomas Mohr KG, Guntramsdorf, Austria.
| |
Collapse
|
19
|
Sidorov P, Naulaerts S, Ariey-Bonnet J, Pasquier E, Ballester PJ. Predicting Synergism of Cancer Drug Combinations Using NCI-ALMANAC Data. Front Chem 2019; 7:509. [PMID: 31380352 PMCID: PMC6646421 DOI: 10.3389/fchem.2019.00509] [Citation(s) in RCA: 79] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2019] [Accepted: 07/02/2019] [Indexed: 12/15/2022] Open
Abstract
Drug combinations are of great interest for cancer treatment. Unfortunately, the discovery of synergistic combinations by purely experimental means is only feasible on small sets of drugs. In silico modeling methods can substantially widen this search by providing tools able to predict which of all possible combinations in a large compound library are synergistic. Here we investigate to which extent drug combination synergy can be predicted by exploiting the largest available dataset to date (NCI-ALMANAC, with over 290,000 synergy determinations). Each cell line is modeled using primarily two machine learning techniques, Random Forest (RF) and Extreme Gradient Boosting (XGBoost), on the datasets provided by NCI-ALMANAC. This large-scale predictive modeling study comprises more than 5,000 pair-wise drug combinations, 60 cell lines, 4 types of models, and 5 types of chemical features. The application of a powerful, yet uncommonly used, RF-specific technique for reliability prediction is also investigated. The evaluation of these models shows that it is possible to predict the synergy of unseen drug combinations with high accuracy (Pearson correlations between 0.43 and 0.86 depending on the considered cell line, with XGBoost providing slightly better predictions than RF). We have also found that restricting to the most reliable synergy predictions results in at least 2-fold error decrease with respect to employing the best learning algorithm without any reliability estimation. Alkylating agents, tyrosine kinase inhibitors and topoisomerase inhibitors are the drugs whose synergy with other partner drugs are better predicted by the models. Despite its leading size, NCI-ALMANAC comprises an extremely small part of all conceivable combinations. Given their accuracy and reliability estimation, the developed models should drastically reduce the number of required in vitro tests by predicting in silico which of the considered combinations are likely to be synergistic.
Collapse
Affiliation(s)
- Pavel Sidorov
- CRCM, INSERM, Cancer Research Center of Marseille, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Marseille, France
| | - Stefan Naulaerts
- CRCM, INSERM, Cancer Research Center of Marseille, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Marseille, France
- Department of Tumor Immunology, Institut de Duve, Bruxelles, Belgium
| | - Jérémy Ariey-Bonnet
- CRCM, INSERM, Cancer Research Center of Marseille, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Marseille, France
| | - Eddy Pasquier
- CRCM, INSERM, Cancer Research Center of Marseille, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Marseille, France
| | - Pedro J. Ballester
- CRCM, INSERM, Cancer Research Center of Marseille, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Marseille, France
| |
Collapse
|
20
|
Lind AP, Anderson PC. Predicting drug activity against cancer cells by random forest models based on minimal genomic information and chemical properties. PLoS One 2019; 14:e0219774. [PMID: 31295321 PMCID: PMC6622537 DOI: 10.1371/journal.pone.0219774] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Accepted: 07/01/2019] [Indexed: 12/27/2022] Open
Abstract
A key goal of precision medicine is predicting the best drug therapy for a specific patient from genomic information. In oncology, cancers that appear similar pathologically can vary greatly in how they respond to the same drug. Fortunately, data from high-throughput screening programs often reveal important relationships between genomic variability of cancer cells and their response to drugs. Nevertheless, many current computational methods to predict compound activity against cancer cells require large quantities of genomic, epigenomic, and additional cellular data to develop and to apply. Here we integrate recent screening data and machine learning to train classification models that predict the activity/inactivity of compounds against cancer cells based on the mutational status of only 145 oncogenes and a set of compound structural descriptors. Using IC50 values of 1 μM as activity cutoffs, our predictive models have sensitivities of 87%, specificities of 87%, and yield an area under the receiver operating characteristic curve equal to 0.94. We also develop regression models to predict log(IC50) values of compounds for cancer cells; the models achieve a Pearson correlation coefficient of 0.86 for cross-validation and up to 0.65-0.73 against blind test sets. Predictive performance remains strong when as few as 50 oncogenes are included. Finally, even when 40% of experimental IC50 values are missing from screening data, they can be imputed with sufficient reliability that classification accuracy is not diminished. The presented models are fast to generate and may serve as easily implemented screening tools for personalized oncology medicine, drug repurposing, and drug discovery.
Collapse
Affiliation(s)
- Alex P. Lind
- Physical Sciences Division, University of Washington Bothell, Bothell, Washington, United States of America
| | - Peter C. Anderson
- Physical Sciences Division, University of Washington Bothell, Bothell, Washington, United States of America
| |
Collapse
|
21
|
Cortés-Ciriano I, Bender A. KekuleScope: prediction of cancer cell line sensitivity and compound potency using convolutional neural networks trained on compound images. J Cheminform 2019; 11:41. [PMID: 31218493 PMCID: PMC6582521 DOI: 10.1186/s13321-019-0364-5] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Accepted: 06/09/2019] [Indexed: 02/08/2023] Open
Abstract
The application of convolutional neural networks (ConvNets) to harness high-content screening images or 2D compound representations is gaining increasing attention in drug discovery. However, existing applications often require large data sets for training, or sophisticated pretraining schemes. Here, we show using 33 IC50 data sets from ChEMBL 23 that the in vitro activity of compounds on cancer cell lines and protein targets can be accurately predicted on a continuous scale from their Kekulé structure representations alone by extending existing architectures (AlexNet, DenseNet-201, ResNet152 and VGG-19), which were pretrained on unrelated image data sets. We show that the predictive power of the generated models, which just require standard 2D compound representations as input, is comparable to that of Random Forest (RF) models and fully-connected Deep Neural Networks trained on circular (Morgan) fingerprints. Notably, including additional fully-connected layers further increases the predictive power of the ConvNets by up to 10%. Analysis of the predictions generated by RF models and ConvNets shows that by simply averaging the output of the RF models and ConvNets we obtain significantly lower errors in prediction for multiple data sets, although the effect size is small, than those obtained with either model alone, indicating that the features extracted by the convolutional layers of the ConvNets provide complementary predictive signal to Morgan fingerprints. Lastly, we show that multi-task ConvNets trained on compound images permit to model COX isoform selectivity on a continuous scale with errors in prediction comparable to the uncertainty of the data. Overall, in this work we present a set of ConvNet architectures for the prediction of compound activity from their Kekulé structure representations with state-of-the-art performance, that require no generation of compound descriptors or use of sophisticated image processing techniques. The code needed to reproduce the results presented in this study and all the data sets are provided at https://github.com/isidroc/kekulescope .
Collapse
Affiliation(s)
- Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW UK
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW UK
| |
Collapse
|
22
|
Piñeiro-Yáñez E, Reboiro-Jato M, Gómez-López G, Perales-Patón J, Troulé K, Rodríguez JM, Tejero H, Shimamura T, López-Casas PP, Carretero J, Valencia A, Hidalgo M, Glez-Peña D, Al-Shahrour F. PanDrugs: a novel method to prioritize anticancer drug treatments according to individual genomic data. Genome Med 2018; 10:41. [PMID: 29848362 PMCID: PMC5977747 DOI: 10.1186/s13073-018-0546-1] [Citation(s) in RCA: 49] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Accepted: 05/04/2018] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Large-sequencing cancer genome projects have shown that tumors have thousands of molecular alterations and their frequency is highly heterogeneous. In such scenarios, physicians and oncologists routinely face lists of cancer genomic alterations where only a minority of them are relevant biomarkers to drive clinical decision-making. For this reason, the medical community agrees on the urgent need of methodologies to establish the relevance of tumor alterations, assisting in genomic profile interpretation, and, more importantly, to prioritize those that could be clinically actionable for cancer therapy. RESULTS We present PanDrugs, a new computational methodology to guide the selection of personalized treatments in cancer patients using the variant lists provided by genome-wide sequencing analyses. PanDrugs offers the largest database of drug-target associations available from well-known targeted therapies to preclinical drugs. Scoring data-driven gene cancer relevance and drug feasibility PanDrugs interprets genomic alterations and provides a prioritized evidence-based list of anticancer therapies. Our tool represents the first drug prescription strategy applying a rational based on pathway context, multi-gene markers impact and information provided by functional experiments. Our approach has been systematically applied to TCGA patients and successfully validated in a cancer case study with a xenograft mouse model demonstrating its utility. CONCLUSIONS PanDrugs is a feasible method to identify potentially druggable molecular alterations and prioritize drugs to facilitate the interpretation of genomic landscape and clinical decision-making in cancer patients. Our approach expands the search of druggable genomic alterations from the concept of cancer driver genes to the druggable pathway context extending anticancer therapeutic options beyond already known cancer genes. The methodology is public and easily integratable with custom pipelines through its programmatic API or its docker image. The PanDrugs webtool is freely accessible at http://www.pandrugs.org .
Collapse
Affiliation(s)
- Elena Piñeiro-Yáñez
- Spanish National Cancer Research Centre (CNIO), 3rd Melchor Fernandez Almagro st., E-28029, Madrid, Spain
| | - Miguel Reboiro-Jato
- Computer Science Department - University of Vigo, Vigo, Spain
- Biomedical Research Centre (CINBIO), Vigo, Spain
| | - Gonzalo Gómez-López
- Spanish National Cancer Research Centre (CNIO), 3rd Melchor Fernandez Almagro st., E-28029, Madrid, Spain
| | - Javier Perales-Patón
- Spanish National Cancer Research Centre (CNIO), 3rd Melchor Fernandez Almagro st., E-28029, Madrid, Spain
| | - Kevin Troulé
- Spanish National Cancer Research Centre (CNIO), 3rd Melchor Fernandez Almagro st., E-28029, Madrid, Spain
| | | | - Héctor Tejero
- Spanish National Cancer Research Centre (CNIO), 3rd Melchor Fernandez Almagro st., E-28029, Madrid, Spain
| | - Takeshi Shimamura
- Loyola University Chicago Stritch School of Medicine, Maywood, IL, USA
| | - Pedro Pablo López-Casas
- Spanish National Cancer Research Centre (CNIO), 3rd Melchor Fernandez Almagro st., E-28029, Madrid, Spain
| | - Julián Carretero
- Department of Physiology - University of Valencia, Valencia, Spain
| | - Alfonso Valencia
- Spanish National Cancer Research Centre (CNIO), 3rd Melchor Fernandez Almagro st., E-28029, Madrid, Spain
| | - Manuel Hidalgo
- Spanish National Cancer Research Centre (CNIO), 3rd Melchor Fernandez Almagro st., E-28029, Madrid, Spain
- Beth Israel Deaconess Medical Center, Boston, USA
| | - Daniel Glez-Peña
- Computer Science Department - University of Vigo, Vigo, Spain
- Biomedical Research Centre (CINBIO), Vigo, Spain
| | - Fátima Al-Shahrour
- Spanish National Cancer Research Centre (CNIO), 3rd Melchor Fernandez Almagro st., E-28029, Madrid, Spain.
| |
Collapse
|