1
|
Kilgore HR, Chinn I, Mikhael PG, Mitnikov I, Van Dongen C, Zylberberg G, Afeyan L, Banani S, Wilson-Hawken S, Lee TI, Barzilay R, Young RA. Chemical codes promote selective compartmentalization of proteins. bioRxiv 2024:2024.04.15.589616. [PMID: 38659952 PMCID: PMC11042338 DOI: 10.1101/2024.04.15.589616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Cells have evolved mechanisms to distribute ~10 billion protein molecules to subcellular compartments where diverse proteins involved in shared functions must efficiently assemble. Such assembly is presumed to unfold as a result of specific interactions between biomolecules; however, recent evidence suggests that distinctive chemical environments within subcellular compartments may also play an important role. Here, we test the hypothesis that protein groups with shared functions also share codes that guide them to compartment destinations. To test our hypothesis, we developed a transformer large language model, called ProtGPS, that predicts with high performance the compartment localization of human proteins excluded from the training set. We then demonstrate ProtGPS can be used for guided generation of novel protein sequences that selectively assemble into specific compartments in cells. Furthermore, ProtGPS predictions were sensitive to disease-associated mutations that produce changes in protein compartmentalization, suggesting that this type of pathogenic dysfunction can be discovered in silico. Our results indicate that protein sequences contain not only a folding code, but also a previously unrecognized chemical code governing their distribution in specific cellular compartments.
Collapse
Affiliation(s)
- Henry R. Kilgore
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
| | - Itamar Chinn
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Peter G. Mikhael
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Ilan Mitnikov
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | | | - Guy Zylberberg
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Lena Afeyan
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Salman Banani
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
- Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
| | - Susana Wilson-Hawken
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
- Program of Computational & Systems Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Tong Ihn Lee
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Richard A. Young
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| |
Collapse
|
2
|
Kilgore HR, Mikhael PG, Overholt KJ, Boija A, Hannett NM, Van Dongen C, Lee TI, Chang YT, Barzilay R, Young RA. Distinct chemical environments in biomolecular condensates. Nat Chem Biol 2024; 20:291-301. [PMID: 37770698 DOI: 10.1038/s41589-023-01432-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2022] [Accepted: 08/31/2023] [Indexed: 09/30/2023]
Abstract
Diverse mechanisms have been described for selective enrichment of biomolecules in membrane-bound organelles, but less is known about mechanisms by which molecules are selectively incorporated into biomolecular assemblies such as condensates that lack surrounding membranes. The chemical environments within condensates may differ from those outside these bodies, and if these differed among various types of condensate, then the different solvation environments would provide a mechanism for selective distribution among these intracellular bodies. Here we use small molecule probes to show that different condensates have distinct chemical solvating properties and that selective partitioning of probes in condensates can be predicted with deep learning approaches. Our results demonstrate that different condensates harbor distinct chemical environments that influence the distribution of molecules, show that clues to condensate chemical grammar can be ascertained by machine learning and suggest approaches to facilitate development of small molecule therapeutics with optimal subcellular distribution and therapeutic benefit.
Collapse
Affiliation(s)
- Henry R Kilgore
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA.
| | - Peter G Mikhael
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Kalon J Overholt
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ann Boija
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
| | - Nancy M Hannett
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
| | | | - Tong Ihn Lee
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
| | - Young-Tae Chang
- Department of Chemistry, Pohang University of Science and Technology, Pohang, Republic of Korea
| | - Regina Barzilay
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Richard A Young
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA.
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA.
| |
Collapse
|
3
|
Corso G, Deng A, Fry B, Polizzi N, Barzilay R, Jaakkola T. Deep Confident Steps to New Pockets: Strategies for Docking Generalization. ArXiv 2024:arXiv:2402.18396v1. [PMID: 38463508 PMCID: PMC10925391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, docking methods must generalize well across the proteome. Existing benchmarks, however, fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that existing machine learning-based docking models have very weak generalization abilities. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. Further, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between diffusion and confidence models and exploits the multi-resolution generation process of diffusion models. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes, edging closer to accurate and generalizable blind docking methods.
Collapse
Affiliation(s)
| | | | - Benjamin Fry
- Dana-Farber Cancer Institute and Harvard Medical School
| | | | | | | |
Collapse
|
4
|
Yim J, Campbell A, Mathieu E, Foong AYK, Gastegger M, Jiménez-Luna J, Lewis S, Satorras VG, Veeling BS, Noé F, Barzilay R, Jaakkola TS. Improved motif-scaffolding with SE(3) flow matching. ArXiv 2024:arXiv:2401.04082v1. [PMID: 38259348 PMCID: PMC10802670] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Protein design often begins with knowledge of a desired function from a motif which motif-scaffolding aims to construct a functional protein around. Recently, generative models have achieved breakthrough success in designing scaffolds for a diverse range of motifs. However, the generated scaffolds tend to lack structural diversity, which can hinder success in wet-lab validation. In this work, we extend FrameFlow, an SE(3) flow matching model for protein backbone generation, to perform motif-scaffolding with two complementary approaches. The first is motif amortization, in which FrameFlow is trained with the motif as input using a data augmentation strategy. The second is motif guidance, which performs scaffolding using an estimate of the conditional score from FrameFlow, and requires no additional training. Both approaches achieve an equivalent or higher success rate than previous state-of-the-art methods, with 2.5 times more structurally diverse scaffolds. Code: https://github.com/microsoft/frame-flow.
Collapse
Affiliation(s)
- Jason Yim
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology
| | | | | | | | | | | | | | | | | | | | - Regina Barzilay
- Computer Science and Articial Intelligence Laboratory, Massachusetts Institute of Technology
| | - Tommi S Jaakkola
- Computer Science and Articial Intelligence Laboratory, Massachusetts Institute of Technology
| |
Collapse
|
5
|
Koscher BA, Canty RB, McDonald MA, Greenman KP, McGill CJ, Bilodeau CL, Jin W, Wu H, Vermeire FH, Jin B, Hart T, Kulesza T, Li SC, Jaakkola TS, Barzilay R, Gómez-Bombarelli R, Green WH, Jensen KF. Autonomous, multiproperty-driven molecular discovery: From predictions to measurements and back. Science 2023; 382:eadi1407. [PMID: 38127734 DOI: 10.1126/science.adi1407] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Accepted: 11/09/2023] [Indexed: 12/23/2023]
Abstract
A closed-loop, autonomous molecular discovery platform driven by integrated machine learning tools was developed to accelerate the design of molecules with desired properties. We demonstrated two case studies on dye-like molecules, targeting absorption wavelength, lipophilicity, and photooxidative stability. In the first study, the platform experimentally realized 294 unreported molecules across three automatic iterations of molecular design-make-test-analyze cycles while exploring the structure-function space of four rarely reported scaffolds. In each iteration, the property prediction models that guided exploration learned the structure-property space of diverse scaffold derivatives, which were realized with multistep syntheses and a variety of reactions. The second study exploited property models trained on the explored chemical space and previously reported molecules to discover nine top-performing molecules within a lightly explored structure-property space.
Collapse
Affiliation(s)
- Brent A Koscher
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Richard B Canty
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Matthew A McDonald
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Kevin P Greenman
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Charles J McGill
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Camille L Bilodeau
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Wengong Jin
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Haoyang Wu
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Florence H Vermeire
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Brooke Jin
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Travis Hart
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Timothy Kulesza
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Shih-Cheng Li
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Tommi S Jaakkola
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Rafael Gómez-Bombarelli
- Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - William H Green
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Klavs F Jensen
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|
6
|
Kilgore HR, Mikhael PG, Overholt KJ, Boija A, Hannett NM, Van Dongen C, Lee TI, Chang YT, Barzilay R, Young RA. Author Correction: Distinct chemical environments in biomolecular condensates. Nat Chem Biol 2023; 19:1561. [PMID: 37880420 DOI: 10.1038/s41589-023-01491-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2023]
Affiliation(s)
- Henry R Kilgore
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA.
| | - Peter G Mikhael
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Kalon J Overholt
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ann Boija
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
| | - Nancy M Hannett
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
| | | | - Tong Ihn Lee
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
| | - Young-Tae Chang
- Department of Chemistry, Pohang University of Science and Technology, Pohang, Republic of Korea
| | - Regina Barzilay
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Richard A Young
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA.
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA.
| |
Collapse
|
7
|
Abstract
Recent advances in artificial intelligence and machine learning (AI/ML) hold substantial promise to address some of the current challenges in lung cancer screening and improve health equity. This article reviews the status and future directions of AI/ML tools in the lung cancer screening workflow, focusing on determining screening eligibility, radiation dose reduction and image denoising for low-dose chest computed tomography (CT), lung nodule detection, lung nodule classification, and determining optimal screening intervals. AI/ML tools can assess for chronic diseases on CT, which creates opportunities to improve population health through opportunistic screening.
Collapse
Affiliation(s)
- Scott J Adams
- Department of Radiology, Stanford University School of Medicine, Stanford, CA, USA
| | - Peter Mikhael
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA; Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jeremy Wohlwend
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA; Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Regina Barzilay
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA; Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Lecia V Sequist
- Department of Medicine, Massachusetts General Hospital, Harvard Medical School, 55 Fruit Street, Boston, MA 02114, USA; Harvard Medical School, Boston, MA, USA.
| | - Florian J Fintelmann
- Harvard Medical School, Boston, MA, USA; Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA 02114, USA.
| |
Collapse
|
8
|
Liu G, Catacutan DB, Rathod K, Swanson K, Jin W, Mohammed JC, Chiappino-Pepe A, Syed SA, Fragis M, Rachwalski K, Magolan J, Surette MG, Coombes BK, Jaakkola T, Barzilay R, Collins JJ, Stokes JM. Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii. Nat Chem Biol 2023; 19:1342-1350. [PMID: 37231267 DOI: 10.1038/s41589-023-01349-8] [Citation(s) in RCA: 37] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Accepted: 04/25/2023] [Indexed: 05/27/2023]
Abstract
Acinetobacter baumannii is a nosocomial Gram-negative pathogen that often displays multidrug resistance. Discovering new antibiotics against A. baumannii has proven challenging through conventional screening approaches. Fortunately, machine learning methods allow for the rapid exploration of chemical space, increasing the probability of discovering new antibacterial molecules. Here we screened ~7,500 molecules for those that inhibited the growth of A. baumannii in vitro. We trained a neural network with this growth inhibition dataset and performed in silico predictions for structurally new molecules with activity against A. baumannii. Through this approach, we discovered abaucin, an antibacterial compound with narrow-spectrum activity against A. baumannii. Further investigations revealed that abaucin perturbs lipoprotein trafficking through a mechanism involving LolE. Moreover, abaucin could control an A. baumannii infection in a mouse wound model. This work highlights the utility of machine learning in antibiotic discovery and describes a promising lead with targeted activity against a challenging Gram-negative pathogen.
Collapse
Affiliation(s)
- Gary Liu
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Denise B Catacutan
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Khushi Rathod
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Kyle Swanson
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Wengong Jin
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jody C Mohammed
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Anush Chiappino-Pepe
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, USA
- Department of Genetics, Harvard Medical School, Boston, MA, USA
| | - Saad A Syed
- Department of Medicine, Department of Biochemistry and Biomedical Sciences, Farncombe Family Digestive Health Research Institute, McMaster University, Hamilton, Ontario, Canada
| | - Meghan Fragis
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
- Department of Chemistry and Chemical Biology, McMaster University, Hamilton, Ontario, Canada
| | - Kenneth Rachwalski
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Jakob Magolan
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
- Department of Chemistry and Chemical Biology, McMaster University, Hamilton, Ontario, Canada
| | - Michael G Surette
- Department of Medicine, Department of Biochemistry and Biomedical Sciences, Farncombe Family Digestive Health Research Institute, McMaster University, Hamilton, Ontario, Canada
| | - Brian K Coombes
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Tommi Jaakkola
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Abdul Latif Jameel Clinic for Machine Learning in Health, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - James J Collins
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, USA.
- Abdul Latif Jameel Clinic for Machine Learning in Health, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Department of Biological Engineering, Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | - Jonathan M Stokes
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada.
| |
Collapse
|
9
|
Simon J, Mikhael P, Tahir I, Graur A, Ringer S, Fata A, Jeffrey YCF, Shepard JA, Jacobson F, Barzilay R, Sequist LV, Pace LE, Fintelmann FJ. Role of sex in lung cancer risk prediction based on single low-dose chest computed tomography. Sci Rep 2023; 13:18611. [PMID: 37903855 PMCID: PMC10616081 DOI: 10.1038/s41598-023-45671-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Accepted: 10/22/2023] [Indexed: 11/01/2023] Open
Abstract
A validated open-source deep-learning algorithm called Sybil can accurately predict long-term lung cancer risk from a single low-dose chest computed tomography (LDCT). However, Sybil was trained on a majority-male cohort. Use of artificial intelligence algorithms trained on imbalanced cohorts may lead to inequitable outcomes in real-world settings. We aimed to study whether Sybil predicts lung cancer risk equally regardless of sex. We analyzed 10,573 LDCTs from 6127 consecutive lung cancer screening participants across a health system between 2015 and 2021. Sybil achieved AUCs of 0.89 (95% CI: 0.85-0.93) for females and 0.89 (95% CI: 0.85-0.94) for males at 1 year, p = 0.92. At 6 years, the AUC was 0.87 (95% CI: 0.83-0.93) for females and 0.79 (95% CI: 0.72-0.86) for males, p = 0.01. In conclusion, Sybil can accurately predict future lung cancer risk in females and males in a real-world setting and performs better in females than in males for predicting 6-year lung cancer risk.
Collapse
Affiliation(s)
- Judit Simon
- Division of Thoracic Imaging and Intervention, Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
- Harvard Medical School, Boston, MA, USA
| | - Peter Mikhael
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ismail Tahir
- Division of Thoracic Imaging and Intervention, Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
- Harvard Medical School, Boston, MA, USA
| | - Alexander Graur
- Division of Thoracic Imaging and Intervention, Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Stefan Ringer
- Division of Thoracic Imaging and Intervention, Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Amanda Fata
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Yang Chi-Fu Jeffrey
- Harvard Medical School, Boston, MA, USA
- Department of Surgery, Massachusetts General Hospital, Boston, MA, USA
| | - Jo-Anne Shepard
- Division of Thoracic Imaging and Intervention, Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
- Harvard Medical School, Boston, MA, USA
| | - Francine Jacobson
- Harvard Medical School, Boston, MA, USA
- Division of Thoracic Imaging, Department of Radiology, Brigham and Women's Hospital, Boston, MA, USA
| | - Regina Barzilay
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Lecia V Sequist
- Harvard Medical School, Boston, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Lydia E Pace
- Harvard Medical School, Boston, MA, USA
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Florian J Fintelmann
- Division of Thoracic Imaging and Intervention, Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA.
- Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
10
|
Panayi A, Ward K, Benhadji-Schaff A, Ibanez-Lopez AS, Xia A, Barzilay R. Evaluation of a prototype machine learning tool to semi-automate data extraction for systematic literature reviews. Syst Rev 2023; 12:187. [PMID: 37803451 PMCID: PMC10557215 DOI: 10.1186/s13643-023-02351-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 09/13/2023] [Indexed: 10/08/2023] Open
Abstract
BACKGROUND Evidence-based medicine requires synthesis of research through rigorous and time-intensive systematic literature reviews (SLRs), with significant resource expenditure for data extraction from scientific publications. Machine learning may enable the timely completion of SLRs and reduce errors by automating data identification and extraction. METHODS We evaluated the use of machine learning to extract data from publications related to SLRs in oncology (SLR 1) and Fabry disease (SLR 2). SLR 1 predominantly contained interventional studies and SLR 2 observational studies. Predefined key terms and data were manually annotated to train and test bidirectional encoder representations from transformers (BERT) and bidirectional long-short-term memory machine learning models. Using human annotation as a reference, we assessed the ability of the models to identify biomedical terms of interest (entities) and their relations. We also pretrained BERT on a corpus of 100,000 open access clinical publications and/or enhanced context-dependent entity classification with a conditional random field (CRF) model. Performance was measured using the F1 score, a metric that combines precision and recall. We defined successful matches as partial overlap of entities of the same type. RESULTS For entity recognition, the pretrained BERT+CRF model had the best performance, with an F1 score of 73% in SLR 1 and 70% in SLR 2. Entity types identified with the highest accuracy were metrics for progression-free survival (SLR 1, F1 score 88%) or for patient age (SLR 2, F1 score 82%). Treatment arm dosage was identified less successfully (F1 scores 60% [SLR 1] and 49% [SLR 2]). The best-performing model for relation extraction, pretrained BERT relation classification, exhibited F1 scores higher than 90% in cases with at least 80 relation examples for a pair of related entity types. CONCLUSIONS The performance of BERT is enhanced by pretraining with biomedical literature and by combining with a CRF model. With refinement, machine learning may assist with manual data extraction for SLRs.
Collapse
Affiliation(s)
- Antonia Panayi
- Takeda Pharmaceuticals International AG, Thurgauerstrasse 130, 8152, Glattpark-Opfikon, Zurich, Switzerland.
| | | | | | | | - Andrew Xia
- Takeda Pharmaceuticals International AG, Thurgauerstrasse 130, 8152, Glattpark-Opfikon, Zurich, Switzerland
| | | |
Collapse
|
11
|
Furuhama A, Kitazawa A, Yao J, Matos Dos Santos CE, Rathman J, Yang C, Ribeiro JV, Cross K, Myatt G, Raitano G, Benfenati E, Jeliazkova N, Saiakhov R, Chakravarti S, Foster RS, Bossa C, Battistelli CL, Benigni R, Sawada T, Wasada H, Hashimoto T, Wu M, Barzilay R, Daga PR, Clark RD, Mestres J, Montero A, Gregori-Puigjané E, Petkov P, Ivanova H, Mekenyan O, Matthews S, Guan D, Spicer J, Lui R, Uesawa Y, Kurosaki K, Matsuzaka Y, Sasaki S, Cronin MTD, Belfield SJ, Firman JW, Spînu N, Qiu M, Keca JM, Gini G, Li T, Tong W, Hong H, Liu Z, Igarashi Y, Yamada H, Sugiyama KI, Honma M. Evaluation of QSAR models for predicting mutagenicity: outcome of the Second Ames/QSAR international challenge project. SAR QSAR Environ Res 2023; 34:983-1001. [PMID: 38047445 DOI: 10.1080/1062936x.2023.2284902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 11/13/2023] [Indexed: 12/05/2023]
Abstract
Quantitative structure-activity relationship (QSAR) models are powerful in silico tools for predicting the mutagenicity of unstable compounds, impurities and metabolites that are difficult to examine using the Ames test. Ideally, Ames/QSAR models for regulatory use should demonstrate high sensitivity, low false-negative rate and wide coverage of chemical space. To promote superior model development, the Division of Genetics and Mutagenesis, National Institute of Health Sciences, Japan (DGM/NIHS), conducted the Second Ames/QSAR International Challenge Project (2020-2022) as a successor to the First Project (2014-2017), with 21 teams from 11 countries participating. The DGM/NIHS provided a curated training dataset of approximately 12,000 chemicals and a trial dataset of approximately 1,600 chemicals, and each participating team predicted the Ames mutagenicity of each trial chemical using various Ames/QSAR models. The DGM/NIHS then provided the Ames test results for trial chemicals to assist in model improvement. Although overall model performance on the Second Project was not superior to that on the First, models from the eight teams participating in both projects achieved higher sensitivity than models from teams participating in only the Second Project. Thus, these evaluations have facilitated the development of QSAR models.
Collapse
Affiliation(s)
- A Furuhama
- Division of Genetics and Mutagenesis (DGM), National Institute of Health Sciences (NIHS), Kawasaki, Japan
| | - A Kitazawa
- Division of Genetics and Mutagenesis (DGM), National Institute of Health Sciences (NIHS), Kawasaki, Japan
| | - J Yao
- Key Laboratory of Fluorine and Nitrogen Chemistry and Advanced Materials (Chinese Academy of Sciences), Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences (SIOC, CAS), Shanghai, China
| | - C E Matos Dos Santos
- Department of Computational Toxicology and In Silico Innovations, Altox Ltd, São Paulo-SP, Brazil
| | - J Rathman
- MN-AM, Nuremberg, Germany/Columbus, OH, USA
| | - C Yang
- MN-AM, Nuremberg, Germany/Columbus, OH, USA
| | | | - K Cross
- In Silico Department, Instem, Conshohocken, PA, USA
| | - G Myatt
- In Silico Department, Instem, Conshohocken, PA, USA
| | - G Raitano
- Laboratory of Environmental Toxicology and Chemistry, Department of Environmental Health Sciences, Istituto di Ricerche Farmacologiche Mario Negri IRCCS (IRFMN), Milano, Italy
| | - E Benfenati
- Laboratory of Environmental Toxicology and Chemistry, Department of Environmental Health Sciences, Istituto di Ricerche Farmacologiche Mario Negri IRCCS (IRFMN), Milano, Italy
| | | | | | | | | | - C Bossa
- Environment and Health Department, Istituto Superiore di Sanità (ISS), Rome, Italy
| | - C Laura Battistelli
- Environment and Health Department, Istituto Superiore di Sanità (ISS), Rome, Italy
| | - R Benigni
- Environment and Health Department, Istituto Superiore di Sanità (ISS), Rome, Italy
- Alpha-PreTox, Rome, Italy
| | - T Sawada
- Faculty of Regional Studies, Gifu University, Gifu, Japan
- xenoBiotic Inc, Gifu, Japan
| | - H Wasada
- Faculty of Regional Studies, Gifu University, Gifu, Japan
| | - T Hashimoto
- Faculty of Regional Studies, Gifu University, Gifu, Japan
| | - M Wu
- Massachusetts Institute of Technology, Cambridge, MA, USA
| | - R Barzilay
- Massachusetts Institute of Technology, Cambridge, MA, USA
| | - P R Daga
- Simulations Plus, Lancaster, CA, USA
| | - R D Clark
- Simulations Plus, Lancaster, CA, USA
| | | | | | | | - P Petkov
- LMC - Bourgas University, Bourgas, Bulgaria
| | - H Ivanova
- LMC - Bourgas University, Bourgas, Bulgaria
| | - O Mekenyan
- LMC - Bourgas University, Bourgas, Bulgaria
| | - S Matthews
- Computational Pharmacology & Toxicology Laboratory, Discipline of Pharmacology, School of Pharmacy, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia
| | - D Guan
- Computational Pharmacology & Toxicology Laboratory, Discipline of Pharmacology, School of Pharmacy, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia
| | - J Spicer
- Computational Pharmacology & Toxicology Laboratory, Discipline of Pharmacology, School of Pharmacy, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia
| | - R Lui
- Computational Pharmacology & Toxicology Laboratory, Discipline of Pharmacology, School of Pharmacy, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia
| | - Y Uesawa
- Department of Medical Molecular Informatics, Meiji Pharmaceutical University, Tokyo, Japan
| | - K Kurosaki
- Department of Medical Molecular Informatics, Meiji Pharmaceutical University, Tokyo, Japan
| | - Y Matsuzaka
- Department of Medical Molecular Informatics, Meiji Pharmaceutical University, Tokyo, Japan
| | - S Sasaki
- Department of Medical Molecular Informatics, Meiji Pharmaceutical University, Tokyo, Japan
| | - M T D Cronin
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool, UK
| | - S J Belfield
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool, UK
| | - J W Firman
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool, UK
| | - N Spînu
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool, UK
| | - M Qiu
- Evergreen AI, Inc, Toronto, Canada
| | - J M Keca
- Evergreen AI, Inc, Toronto, Canada
| | - G Gini
- Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano, Milano, Italy
| | - T Li
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration (NCTR/FDA), Jefferson, AR, USA
| | - W Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration (NCTR/FDA), Jefferson, AR, USA
| | - H Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration (NCTR/FDA), Jefferson, AR, USA
| | - Z Liu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration (NCTR/FDA), Jefferson, AR, USA
- Integrative Toxicology, Nonclinical Drug Safety, Boehringer Ingelheim Pharmaceuticals, Inc, Ridgefield, CT, USA
| | - Y Igarashi
- Artificial Intelligence Center for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health and Nutrition (NIBIOHN), Osaka, Japan
| | - H Yamada
- Artificial Intelligence Center for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health and Nutrition (NIBIOHN), Osaka, Japan
| | - K-I Sugiyama
- Division of Genetics and Mutagenesis (DGM), National Institute of Health Sciences (NIHS), Kawasaki, Japan
| | - M Honma
- Division of Genetics and Mutagenesis (DGM), National Institute of Health Sciences (NIHS), Kawasaki, Japan
| |
Collapse
|
12
|
Watson JL, Juergens D, Bennett NR, Trippe BL, Yim J, Eisenach HE, Ahern W, Borst AJ, Ragotte RJ, Milles LF, Wicky BIM, Hanikel N, Pellock SJ, Courbet A, Sheffler W, Wang J, Venkatesh P, Sappington I, Torres SV, Lauko A, De Bortoli V, Mathieu E, Ovchinnikov S, Barzilay R, Jaakkola TS, DiMaio F, Baek M, Baker D. De novo design of protein structure and function with RFdiffusion. Nature 2023; 620:1089-1100. [PMID: 37433327 PMCID: PMC10468394 DOI: 10.1038/s41586-023-06415-8] [Citation(s) in RCA: 108] [Impact Index Per Article: 108.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Accepted: 07/07/2023] [Indexed: 07/13/2023]
Abstract
There has been considerable recent progress in designing new proteins using deep-learning methods1-9. Despite this progress, a general deep-learning framework for protein design that enables solution of a wide range of design challenges, including de novo binder design and design of higher-order symmetric architectures, has yet to be described. Diffusion models10,11 have had considerable success in image and language generative modelling but limited success when applied to protein modelling, probably due to the complexity of protein backbone geometry and sequence-structure relationships. Here we show that by fine-tuning the RoseTTAFold structure prediction network on protein structure denoising tasks, we obtain a generative model of protein backbones that achieves outstanding performance on unconditional and topology-constrained protein monomer design, protein binder design, symmetric oligomer design, enzyme active site scaffolding and symmetric motif scaffolding for therapeutic and metal-binding protein design. We demonstrate the power and generality of the method, called RoseTTAFold diffusion (RFdiffusion), by experimentally characterizing the structures and functions of hundreds of designed symmetric assemblies, metal-binding proteins and protein binders. The accuracy of RFdiffusion is confirmed by the cryogenic electron microscopy structure of a designed binder in complex with influenza haemagglutinin that is nearly identical to the design model. In a manner analogous to networks that produce images from user-specified inputs, RFdiffusion enables the design of diverse functional proteins from simple molecular specifications.
Collapse
Affiliation(s)
- Joseph L Watson
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - David Juergens
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Graduate Program in Molecular Engineering, University of Washington, Seattle, WA, USA
| | - Nathaniel R Bennett
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Graduate Program in Molecular Engineering, University of Washington, Seattle, WA, USA
| | - Brian L Trippe
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Columbia University, Department of Statistics, New York, NY, USA
- Irving Institute for Cancer Dynamics, Columbia University, New York, NY, USA
| | - Jason Yim
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Helen E Eisenach
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Woody Ahern
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Andrew J Borst
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Robert J Ragotte
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Lukas F Milles
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Basile I M Wicky
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Nikita Hanikel
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Samuel J Pellock
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Alexis Courbet
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- National Centre for Scientific Research, École Normale Supérieure rue d'Ulm, Paris, France
| | - William Sheffler
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Jue Wang
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Preetham Venkatesh
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Graduate Program in Biological Physics, Structure and Design, University of Washington, Seattle, WA, USA
| | - Isaac Sappington
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Graduate Program in Biological Physics, Structure and Design, University of Washington, Seattle, WA, USA
| | - Susana Vázquez Torres
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Graduate Program in Biological Physics, Structure and Design, University of Washington, Seattle, WA, USA
| | - Anna Lauko
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Graduate Program in Biological Physics, Structure and Design, University of Washington, Seattle, WA, USA
| | - Valentin De Bortoli
- National Centre for Scientific Research, École Normale Supérieure rue d'Ulm, Paris, France
| | - Emile Mathieu
- Department of Engineering, University of Cambridge, Cambridge, UK
| | - Sergey Ovchinnikov
- Faculty of Applied Sciences, Harvard University, Cambridge, MA, USA
- John Harvard Distinguished Science Fellowship, Harvard University, Cambridge, MA, USA
| | | | | | - Frank DiMaio
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Minkyung Baek
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA, USA.
- Institute for Protein Design, University of Washington, Seattle, WA, USA.
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.
| |
Collapse
|
13
|
Abstract
Reaction diagram parsing is the task of extracting reaction schemes from a diagram in the chemistry literature. The reaction diagrams can be arbitrarily complex; thus, robustly parsing them into structured data is an open challenge. In this paper, we present RxnScribe, a machine learning model for parsing reaction diagrams of varying styles. We formulate this structured prediction task with a sequence generation approach, which condenses the traditional pipeline into an end-to-end model. We train RxnScribe on a dataset of 1378 diagrams and evaluate it with cross validation, achieving an 80.0% soft match F1 score, with significant improvements over previous models. Our code and data are publicly available at https://github.com/thomas0809/RxnScribe.
Collapse
Affiliation(s)
- Yujie Qian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Jiang Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Zhengkai Tu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
14
|
Mikhael PG, Wohlwend J, Yala A, Karstens L, Xiang J, Takigami AK, Bourgouin PP, Chan P, Mrah S, Amayri W, Juan YH, Yang CT, Wan YL, Lin G, Sequist LV, Fintelmann FJ, Barzilay R. Sybil: A Validated Deep Learning Model to Predict Future Lung Cancer Risk From a Single Low-Dose Chest Computed Tomography. J Clin Oncol 2023; 41:2191-2200. [PMID: 36634294 PMCID: PMC10419602 DOI: 10.1200/jco.22.01345] [Citation(s) in RCA: 29] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2022] [Revised: 10/10/2022] [Accepted: 11/29/2022] [Indexed: 01/13/2023] Open
Abstract
PURPOSE Low-dose computed tomography (LDCT) for lung cancer screening is effective, although most eligible people are not being screened. Tools that provide personalized future cancer risk assessment could focus approaches toward those most likely to benefit. We hypothesized that a deep learning model assessing the entire volumetric LDCT data could be built to predict individual risk without requiring additional demographic or clinical data. METHODS We developed a model called Sybil using LDCTs from the National Lung Screening Trial (NLST). Sybil requires only one LDCT and does not require clinical data or radiologist annotations; it can run in real time in the background on a radiology reading station. Sybil was validated on three independent data sets: a heldout set of 6,282 LDCTs from NLST participants, 8,821 LDCTs from Massachusetts General Hospital (MGH), and 12,280 LDCTs from Chang Gung Memorial Hospital (CGMH, which included people with a range of smoking history including nonsmokers). RESULTS Sybil achieved area under the receiver-operator curves for lung cancer prediction at 1 year of 0.92 (95% CI, 0.88 to 0.95) on NLST, 0.86 (95% CI, 0.82 to 0.90) on MGH, and 0.94 (95% CI, 0.91 to 1.00) on CGMH external validation sets. Concordance indices over 6 years were 0.75 (95% CI, 0.72 to 0.78), 0.81 (95% CI, 0.77 to 0.85), and 0.80 (95% CI, 0.75 to 0.86) for NLST, MGH, and CGMH, respectively. CONCLUSION Sybil can accurately predict an individual's future lung cancer risk from a single LDCT scan to further enable personalized screening. Future study is required to understand Sybil's clinical applications. Our model and annotations are publicly available. [Media: see text].
Collapse
Affiliation(s)
- Peter G. Mikhael
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA
- Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA
| | - Jeremy Wohlwend
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA
- Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA
| | - Adam Yala
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA
- Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA
| | - Ludvig Karstens
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA
- Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA
| | - Justin Xiang
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA
- Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA
| | - Angelo K. Takigami
- Harvard Medical School, Boston, MA
- Department of Radiology, Massachusetts General Hospital, Boston, MA
| | - Patrick P. Bourgouin
- Harvard Medical School, Boston, MA
- Department of Radiology, Massachusetts General Hospital, Boston, MA
| | - PuiYee Chan
- Department of Medicine, Massachusetts General Hospital, Boston, MA
| | - Sofiane Mrah
- Department of Radiology, Massachusetts General Hospital, Boston, MA
| | - Wael Amayri
- Department of Radiology, Massachusetts General Hospital, Boston, MA
| | - Yu-Hsiang Juan
- Chang Gung University, Taoyuan, Taiwan
- Department of Medical Imaging and Intervention, Chang Gung Memorial Hospital, Taoyuan, Taiwan
| | - Cheng-Ta Yang
- Chang Gung University, Taoyuan, Taiwan
- Department of Thoracic Medicine, Chang Gung Memorial Hospital, Taoyuan, Taiwan
| | - Yung-Liang Wan
- Chang Gung University, Taoyuan, Taiwan
- Department of Medical Imaging and Intervention, Chang Gung Memorial Hospital, Taoyuan, Taiwan
| | - Gigin Lin
- Chang Gung University, Taoyuan, Taiwan
- Department of Medical Imaging and Intervention, Chang Gung Memorial Hospital, Taoyuan, Taiwan
| | - Lecia V. Sequist
- Harvard Medical School, Boston, MA
- Department of Medicine, Massachusetts General Hospital, Boston, MA
| | - Florian J. Fintelmann
- Harvard Medical School, Boston, MA
- Department of Radiology, Massachusetts General Hospital, Boston, MA
| | - Regina Barzilay
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA
- Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA
| |
Collapse
|
15
|
Qian Y, Guo J, Tu Z, Li Z, Coley CW, Barzilay R. MolScribe: Robust Molecular Structure Recognition with Image-to-Graph Generation. J Chem Inf Model 2023; 63:1925-1934. [PMID: 36971363 DOI: 10.1021/acs.jcim.2c01480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/29/2023]
Abstract
Molecular structure recognition is the task of translating a molecular image into its graph structure. Significant variation in drawing styles and conventions exhibited in chemical literature poses a significant challenge for automating this task. In this paper, we propose MolScribe, a novel image-to-graph generation model that explicitly predicts atoms and bonds, along with their geometric layouts, to construct the molecular structure. Our model flexibly incorporates symbolic chemistry constraints to recognize chirality and expand abbreviated structures. We further develop data augmentation strategies to enhance the model robustness against domain shifts. In experiments on both synthetic and realistic molecular images, MolScribe significantly outperforms previous models, achieving 76-93% accuracy on public benchmarks. Chemists can also easily verify MolScribe's prediction, informed by its confidence estimation and atom-level alignment with the input image. MolScribe is publicly available through Python and web interfaces: https://github.com/thomas0809/MolScribe.
Collapse
Affiliation(s)
- Yujie Qian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Jiang Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Zhengkai Tu
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Zhening Li
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
16
|
Corso G, Jing B, Stark H, Barzilay R, Jaakkola T. Blind protein-ligand docking with diffusion-based deep generative models. Biophys J 2023; 122:143a. [PMID: 36782655 DOI: 10.1016/j.bpj.2022.11.937] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/12/2023] Open
Affiliation(s)
- Gabriele Corso
- Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Bowen Jing
- Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Hannes Stark
- Massachusetts Institute of Technology, Cambridge, MA, USA
| | | | - Tommi Jaakkola
- Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|
17
|
Yala A, Mikhael PG, Hughes K, Barzilay R. Reply to M. Eriksson et al and Z. Jin et al. J Clin Oncol 2022; 40:2281-2282. [PMID: 35452271 DOI: 10.1200/jco.22.00292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Affiliation(s)
- Adam Yala
- Adam Yala, ME, and Peter G. Mikhael, BS, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA; Kevin Hughes, MD, Department of Surgery, Medical University of South Carolina, Charleston, SC; and Regina Barzilay, PhD, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA
| | - Peter G Mikhael
- Adam Yala, ME, and Peter G. Mikhael, BS, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA; Kevin Hughes, MD, Department of Surgery, Medical University of South Carolina, Charleston, SC; and Regina Barzilay, PhD, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA
| | - Kevin Hughes
- Adam Yala, ME, and Peter G. Mikhael, BS, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA; Kevin Hughes, MD, Department of Surgery, Medical University of South Carolina, Charleston, SC; and Regina Barzilay, PhD, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA
| | - Regina Barzilay
- Adam Yala, ME, and Peter G. Mikhael, BS, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA; Kevin Hughes, MD, Department of Surgery, Medical University of South Carolina, Charleston, SC; and Regina Barzilay, PhD, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA
| |
Collapse
|
18
|
White LK, Crowley TB, Finucane B, Garcia-Minaur S, Repetto GM, van den Bree M, Fischer M, Jacquemont S, Barzilay R, Maillard AM, Donald KA, Gur RE, Bassett AS, Swillen A, McDonald-McGinn DM. The COVID-19 pandemic's impact on worry and medical disruptions reported by individuals with chromosome 22q11.2 copy number variants and their caregivers. J Intellect Disabil Res 2022; 66:313-322. [PMID: 35191118 PMCID: PMC9725107 DOI: 10.1111/jir.12918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 01/04/2022] [Accepted: 01/09/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND The world has suffered immeasurably during the COVID-19 pandemic. Increased distress and mental and medical health concerns are collateral consequences to the disease itself. The Genes to Mental Health (G2MH) Network consortium sought to understand how individuals affected by the rare copy number variations of 22q11.2 deletion and duplication syndrome, associated with neurodevelopmental/neuropsychiatric conditions, were coping. The article focuses on worry and disruptions in medical care caused by the pandemic. METHODS The University of Pennsylvania COVID-19 Stressor List and care disruption questions were circulated by 22 advocacy groups in English and 11 other languages. RESULTS A total of 512 people from 23 countries completed the survey; most were caregivers of affected individuals. Worry about family members acquiring COVID-19 had the highest average endorsed worry, whilst currently having COVID-19 had the lowest rated worry. Total COVID-19 worries were higher in individuals completing the survey towards the end of the study (later pandemic wave); 36% (n = 186) of the sample reported a significant effect on health due to care interruption during the pandemic; 44% of individuals (n = 111) receiving care for their genetic syndrome in a hospital setting reported delaying appointments due to COVID-19 fears; 12% (n = 59) of the sample reported disruptions to treatments; and of those reporting no current disruptions, 59% (n = 269) worried about future disruptions if the pandemic continued. Higher levels of care disruptions were related to higher COVID-19 worries (Ps < 0.005). Minimal differences by respondent type or copy number variation type emerged. CONCLUSIONS Widespread medical care disruptions and pandemic-related worries were reported by individuals with 22q11.2 syndrome and their family members. Reported worries were broadly consistent with research results from prior reports in the general population. The long-term effects of COVID-19 worries, interruptions to care and hospital avoidance require further study.
Collapse
Affiliation(s)
- L K White
- Lifespan Brain Institute, Children's Hospital of Philadelphia and Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
| | - T B Crowley
- Lifespan Brain Institute, Children's Hospital of Philadelphia and Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
| | - B Finucane
- Geisinger Autism & Developmental Medicine Institute, Geisinger Health System, Lewisburg, PA, USA
| | - S Garcia-Minaur
- Instituto de Genética Médica y Molecular, Hospital Universitario La Paz, Madrid, Spain
| | - G M Repetto
- Center for Genetics and Genomics, Facultad de Medicina Clínica Alemana - Universidad del Desarrollo, Santiago, Chile
| | - M van den Bree
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical Neurosciences, Cardiff University, Cardiff, UK
| | - M Fischer
- Clinic and Policlinic for Psychiatry and Psychotherapy, University of Rostock, Rostock, Germany
| | - S Jacquemont
- Sainte Justine Research Center, University of Montreal, Montreal, Canada
| | - R Barzilay
- Lifespan Brain Institute, Children's Hospital of Philadelphia and Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
| | - A M Maillard
- Service des Troubles du Spectre de l'Autisme (STSA), Lausanne University Hospital, Lausanne, Switzerland
| | - K A Donald
- Red Cross War Memorial Children's Hospital, University of Cape Town, Cape Town, South Africa
| | - R E Gur
- Lifespan Brain Institute, Children's Hospital of Philadelphia and Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
| | - A S Bassett
- Centre for Addiction and Mental Health, University Health Network and Department of Psychiatry, University of Toronto, Toronto, Canada
| | - A Swillen
- Center for Human Genetics, University Hospital Leuven and Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - D M McDonald-McGinn
- Lifespan Brain Institute, Children's Hospital of Philadelphia and Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
19
|
Bilodeau C, Jin W, Jaakkola T, Barzilay R, Jensen KF. Generative models for molecular discovery: Recent advances and challenges. WIREs Comput Mol Sci 2022. [DOI: 10.1002/wcms.1608] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Affiliation(s)
- Camille Bilodeau
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge Massachusetts USA
| | - Wengong Jin
- Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge Massachusetts USA
| | - Tommi Jaakkola
- Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge Massachusetts USA
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge Massachusetts USA
| | - Klavs F. Jensen
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge Massachusetts USA
| |
Collapse
|
20
|
Birnbaum R, Barzilay R, Brusilov M, Acharya P, Malinger G, Krajden Haratz K. Early second-trimester three-dimensional transvaginal neurosonography of fetal midbrain and hindbrain: normative data and technical aspects. Ultrasound Obstet Gynecol 2022; 59:317-324. [PMID: 34002885 DOI: 10.1002/uog.23691] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2021] [Revised: 04/11/2021] [Accepted: 05/09/2021] [Indexed: 06/12/2023]
Abstract
OBJECTIVES To provide a detailed description of the sonographic appearance and development of various fetal structures of the midbrain and hindbrain (MBHB) during the early second trimester, and to evaluate the impact of the frequency of the transvaginal sonography (TVS) transducer on the early recognition of these structures. METHODS This was a retrospective analysis of three-dimensional volumetric datasets of the MBHB from apparently normal fetuses at 14-19 gestational weeks, acquired by TVS in the midsagittal view through the posterior fontanelle. Using a multiplanar approach, we measured the tectal thickness and length, aqueductal thickness, tegmental thickness and width and height of the Blake's pouch (BP) neck. In addition, we assessed the existence of early vermian fissures, the linear shape of the brainstem and the components of the fastigium. The correlation between gestational age according to last menstrual period and sonographic measurements of MBHB structures was evaluated using Pearson's correlation (r). A subanalysis was performed to assess the performance of a 5-9-MHz vs a 6-12-MHz TVS transducer in visualizing the MBHB structures in the early second trimester. RESULTS Sixty brain volumes were included in the study, obtained at a mean gestational age of 16.2 weeks (range, 14.1-19.0 weeks), with a transverse cerebellar diameter range of 13.0-19.8 mm. We found a strong correlation between gestational age and all MBHB measurements, with the exception of the tectal, tegmental and aqueductal thicknesses, for which the correlation was moderate. There was good-to-excellent intraobserver and moderate-to-good interobserver correlation for most MBHB measurements. We observed that the BP neck was patent in all fetuses between 14 and 18 weeks with decreasing diameter, and that the aqueductal thickness was significantly smaller at ≥ 18 weeks compared with at < 16 weeks. The early vermian fissures and the linear shape of the brainstem were present in all fetuses from 14 weeks. We found that, in the early second trimester, the horizontal arm of the presumed 'fastigium' evolves from the fourth ventricular choroid plexus and not the posterior vermis, indicating that this is not the fastigium. Standard- and high-resolution TVS transducers performed similarly in the assessment of MBHB anatomy. CONCLUSION Detailed early second-trimester assessment of the MBHB is feasible by transvaginal neurosonography and provides reference data which may help in the early detection of brain pathology involving the MBHB. © 2021 International Society of Ultrasound in Obstetrics and Gynecology.
Collapse
Affiliation(s)
- R Birnbaum
- Ob-Gyn Ultrasound Unit, Lis Maternity Hospital, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - R Barzilay
- Lifespan Brain Institute, Penn Medicine and Children's Hospital of Philadelphia (CHOP), Philadelphia, PA, USA
| | - M Brusilov
- Ob-Gyn Ultrasound Unit, Lis Maternity Hospital, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - P Acharya
- Paras Advanced Center for Fetal Medicine, Ahmedabad, India
| | - G Malinger
- Ob-Gyn Ultrasound Unit, Lis Maternity Hospital, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - K Krajden Haratz
- Ob-Gyn Ultrasound Unit, Lis Maternity Hospital, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
21
|
Barzilay R. Abstract ES2-1: Everything you always wanted to know about AI but were afraid to ask. Cancer Res 2022. [DOI: 10.1158/1538-7445.sabcs21-es2-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
In this talk, I will introduce the audience to the foundations of artificial intelligence and its application to breast cancer diagnostics and care. By nature, many of the traditional clinical tasks such as risk assessment, prediction of treatment efficacy and forecasting patient trajectory can be thought of as prediction problems. Given sufficient amounts of patient data with outcomes, a machine learning model can make predictions which often exceed in accuracy human experts. However, to make these tools more applicable in the clinical setting, we need to augment AI models with the ability to explain their decisions to humans, and assess their uncertainty. In my talk, I will give multiple examples of deployed AI applications, analyzing their strengths and weaknesses. First, I will describe a natural language processing system for extracting tumor information from pathology reports. Next, I will talk about image-based AI models for risk assessment and early detection of breast cancer. Finally, I will summarize AI drug discovery methods for personalized medicine.
Citation Format: R Barzilay. Everything you always wanted to know about AI but were afraid to ask [abstract]. In: Proceedings of the 2021 San Antonio Breast Cancer Symposium; 2021 Dec 7-10; San Antonio, TX. Philadelphia (PA): AACR; Cancer Res 2022;82(4 Suppl):Abstract nr ES2-1.
Collapse
|
22
|
Bilodeau C, Jin W, Xu H, Emerson JA, Mukhopadhyay S, Kalantar TH, Jaakkola T, Barzilay R, Jensen KF. Generating molecules with optimized aqueous solubility using iterative graph translation. REACT CHEM ENG 2022. [DOI: 10.1039/d1re00315a] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
We present a generative modeling framework that can be used to discover new, optimal molecules. Our method involves iteratively 1) training a translation model, and 2) translating all molecules in the training dataset.
Collapse
Affiliation(s)
- Camille Bilodeau
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA
| | - Wengong Jin
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Hongyun Xu
- Dow Chemical Company, Midland, MI 48674, USA
| | | | | | | | - Tommi Jaakkola
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Klavs F. Jensen
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA
| |
Collapse
|
23
|
Yala A, Mikhael PG, Strand F, Lin G, Satuluru S, Kim T, Banerjee I, Gichoya J, Trivedi H, Lehman CD, Hughes K, Sheedy DJ, Matthis LM, Karunakaran B, Hegarty KE, Sabino S, Silva TB, Evangelista MC, Caron RF, Souza B, Mauad EC, Patalon T, Handelman-Gotlib S, Guindy M, Barzilay R. Multi-Institutional Validation of a Mammography-Based Breast Cancer Risk Model. J Clin Oncol 2021; 40:1732-1740. [PMID: 34767469 DOI: 10.1200/jco.21.01337] [Citation(s) in RCA: 61] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Accurate risk assessment is essential for the success of population screening programs in breast cancer. Models with high sensitivity and specificity would enable programs to target more elaborate screening efforts to high-risk populations, while minimizing overtreatment for the rest. Artificial intelligence (AI)-based risk models have demonstrated a significant advance over risk models used today in clinical practice. However, the responsible deployment of novel AI requires careful validation across diverse populations. To this end, we validate our AI-based model, Mirai, across globally diverse screening populations. METHODS We collected screening mammograms and pathology-confirmed breast cancer outcomes from Massachusetts General Hospital, USA; Novant, USA; Emory, USA; Maccabi-Assuta, Israel; Karolinska, Sweden; Chang Gung Memorial Hospital, Taiwan; and Barretos, Brazil. We evaluated Uno's concordance-index for Mirai in predicting risk of breast cancer at one to five years from the mammogram. RESULTS A total of 128,793 mammograms from 62,185 patients were collected across the seven sites, of which 3,815 were followed by a cancer diagnosis within 5 years. Mirai obtained concordance indices of 0.75 (95% CI, 0.72 to 0.78), 0.75 (95% CI, 0.70 to 0.80), 0.77 (95% CI, 0.75 to 0.79), 0.77 (95% CI, 0.73 to 0.81), 0.81 (95% CI, 0.79 to 0.82), 0.79 (95% CI, 0.76 to 0.83), and 0.84 (95% CI, 0.81 to 0.88) at Massachusetts General Hospital, Novant, Emory, Maccabi-Assuta, Karolinska, Chang Gung Memorial Hospital, and Barretos, respectively. CONCLUSION Mirai, a mammography-based risk model, maintained its accuracy across globally diverse test sets from seven hospitals across five countries. This is the broadest validation to date of an AI-based breast cancer model and suggests that the technology can offer broad and equitable improvements in care.
Collapse
Affiliation(s)
- Adam Yala
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA.,Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA
| | - Peter G Mikhael
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA.,Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA
| | - Fredrik Strand
- Breast Radiology Unit, Department of Imaging and Physiology, Karolinska University Hospital, Stockholm, Sweden.,Department of Oncology-Pathology, Karolinska Institute, Stockholm, Sweden
| | - Gigin Lin
- Department of Medical Imaging and Intervention, Chang Gung Memorial Hospital at Linkou, Taoyuan, Taiwan
| | - Siddharth Satuluru
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA
| | - Thomas Kim
- Department of Computer Science, Georgia Institute of Technology, Atlanta, GA
| | - Imon Banerjee
- Department of Biomedical Informatics, Emory University, Atlanta, GA
| | - Judy Gichoya
- Department of Radiology, Emory University, Atlanta, GA
| | - Hari Trivedi
- Department of Radiology, Emory University, Atlanta, GA
| | - Constance D Lehman
- Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
| | - Kevin Hughes
- Division of Surgical Oncology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
| | - David J Sheedy
- Analytics and Informatics Department, Novant Health, Winston-Salem, NC
| | - Lisa M Matthis
- Analytics and Informatics Department, Novant Health, Winston-Salem, NC
| | - Bipin Karunakaran
- Analytics and Informatics Department, Novant Health, Winston-Salem, NC
| | - Karen E Hegarty
- Digital Product and Services, Novant Health, Winston-Salem, NC
| | - Silvia Sabino
- Department of Cancer Prevention, Barretos Cancer Hospital, Barretos, Brazil
| | - Thiago B Silva
- Department of Cancer Prevention, Barretos Cancer Hospital, Barretos, Brazil
| | | | - Renato F Caron
- Department of Cancer Prevention, Barretos Cancer Hospital, Barretos, Brazil
| | - Bruno Souza
- Department of Cancer Prevention, Barretos Cancer Hospital, Barretos, Brazil
| | - Edmundo C Mauad
- Department of Cancer Prevention, Barretos Cancer Hospital, Barretos, Brazil
| | - Tal Patalon
- Maccabitech, Maccabi Health Services, Tel Aviv, Israel
| | | | - Michal Guindy
- Department of Imaging, Assuta Medical Centers, Tel Aviv, Israel
| | - Regina Barzilay
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA.,Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA
| |
Collapse
|
24
|
Santus E, Schuster T, Tahmasebi AM, Li C, Yala A, Lanahan CR, Prinsen P, Thompson SF, Coons S, Mynderse L, Barzilay R, Hughes K. Exploiting Rules to Enhance Machine Learning in Extracting Information From Multi-Institutional Prostate Pathology Reports. JCO Clin Cancer Inform 2021; 4:865-874. [PMID: 33006906 DOI: 10.1200/cci.20.00028] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.
Collapse
Affiliation(s)
- Enrico Santus
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, MA
| | - Tal Schuster
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, MA
| | | | - Clara Li
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, MA
| | - Adam Yala
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, MA
| | - Conor R Lanahan
- Department of Oncology, Massachusetts General Hospital, Boston, MA
| | | | | | | | | | - Regina Barzilay
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, MA
| | - Kevin Hughes
- Department of Oncology, Massachusetts General Hospital, Boston, MA
| |
Collapse
|
25
|
Guo J, Ibanez-Lopez AS, Gao H, Quach V, Coley CW, Jensen KF, Barzilay R. Correction to Automated Chemical Reaction Extraction from Scientific Literature. J Chem Inf Model 2021; 61:4124. [PMID: 34297557 DOI: 10.1021/acs.jcim.1c00834] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
26
|
Yala A, Mikhael PG, Strand F, Lin G, Smith K, Wan YL, Lamb L, Hughes K, Lehman C, Barzilay R. Toward robust mammography-based models for breast cancer risk. Sci Transl Med 2021; 13:13/578/eaba4373. [PMID: 33504648 DOI: 10.1126/scitranslmed.aba4373] [Citation(s) in RCA: 72] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Revised: 07/24/2020] [Accepted: 12/21/2020] [Indexed: 12/14/2022]
Abstract
Improved breast cancer risk models enable targeted screening strategies that achieve earlier detection and less screening harm than existing guidelines. To bring deep learning risk models to clinical practice, we need to further refine their accuracy, validate them across diverse populations, and demonstrate their potential to improve clinical workflows. We developed Mirai, a mammography-based deep learning model designed to predict risk at multiple timepoints, leverage potentially missing risk factor information, and produce predictions that are consistent across mammography machines. Mirai was trained on a large dataset from Massachusetts General Hospital (MGH) in the United States and tested on held-out test sets from MGH, Karolinska University Hospital in Sweden, and Chang Gung Memorial Hospital (CGMH) in Taiwan, obtaining C-indices of 0.76 (95% confidence interval, 0.74 to 0.80), 0.81 (0.79 to 0.82), and 0.79 (0.79 to 0.83), respectively. Mirai obtained significantly higher 5-year ROC AUCs than the Tyrer-Cuzick model ( P < 0.001) and prior deep learning models Hybrid DL ( P < 0.001) and Image-Only DL ( P < 0.001), trained on the same dataset. Mirai more accurately identified high-risk patients than prior methods across all datasets. On the MGH test set, 41.5% (34.4 to 48.5) of patients who would develop cancer within 5 years were identified as high risk, compared with 36.1% (29.1 to 42.9) by Hybrid DL ( P = 0.02) and 22.9% (15.9 to 29.6) by the Tyrer-Cuzick model ( P < 0.001).
Collapse
Affiliation(s)
- Adam Yala
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. .,Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Peter G Mikhael
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.,Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Fredrik Strand
- Breast Radiology Unit, Department of Imaging and Physiology, Karolinska University Hospital, 17164 Solna, Sweden.,Department of Oncology-Pathology, Karolinska Institute, 17164 Solna, Sweden
| | - Gigin Lin
- Department of Medical Imaging and Intervention, Chang Gung Memorial Hospital at Linkou, Taoyuan 333, Taiwan
| | - Kevin Smith
- School of Electrical Engineering and Computer, KTH Royal Institute of Technology, 10044 Stockholm, Sweden.,Science for Life Laboratory, 17165 Solna, Sweden
| | - Yung-Liang Wan
- Department of Medical Imaging and Intervention, Chang Gung Memorial Hospital at Linkou, Taoyuan 333, Taiwan
| | - Leslie Lamb
- Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | - Kevin Hughes
- Division of Surgical Oncology, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | - Constance Lehman
- Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | - Regina Barzilay
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.,Jameel Clinic, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| |
Collapse
|
27
|
Soenksen LR, Kassis T, Conover ST, Marti-Fuster B, Birkenfeld JS, Tucker-Schwartz J, Naseem A, Stavert RR, Kim CC, Senna MM, Avilés-Izquierdo J, Collins JJ, Barzilay R, Gray ML. Using deep learning for dermatologist-level detection of suspicious pigmented skin lesions from wide-field images. Sci Transl Med 2021; 13:13/581/eabb3652. [PMID: 33597262 DOI: 10.1126/scitranslmed.abb3652] [Citation(s) in RCA: 45] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Revised: 08/17/2020] [Accepted: 01/08/2021] [Indexed: 11/03/2022]
Abstract
A reported 96,480 people were diagnosed with melanoma in the United States in 2019, leading to 7230 reported deaths. Early-stage identification of suspicious pigmented lesions (SPLs) in primary care settings can lead to improved melanoma prognosis and a possible 20-fold reduction in treatment cost. Despite this clinical and economic value, efficient tools for SPL detection are mostly absent. To bridge this gap, we developed an SPL analysis system for wide-field images using deep convolutional neural networks (DCNNs) and applied it to a 38,283 dermatological dataset collected from 133 patients and publicly available images. These images were obtained from a variety of consumer-grade cameras (15,244 nondermoscopy) and classified by three board-certified dermatologists. Our system achieved more than 90.3% sensitivity (95% confidence interval, 90 to 90.6) and 89.9% specificity (89.6 to 90.2%) in distinguishing SPLs from nonsuspicious lesions, skin, and complex backgrounds, avoiding the need for cumbersome individual lesion imaging. We also present a new method to extract intrapatient lesion saliency (ugly duckling criteria) on the basis of DCNN features from detected lesions. This saliency ranking was validated against three board-certified dermatologists using a set of 135 individual wide-field images from 68 dermatological patients not included in the DCNN training set, exhibiting 82.96% (67.88 to 88.26%) agreement with at least one of the top three lesions in the dermatological consensus ranking. This method could allow for rapid and accurate assessments of pigmented lesion suspiciousness within a primary care visit and could enable improved patient triaging, utilization of resources, and earlier treatment of melanoma.
Collapse
Affiliation(s)
- Luis R Soenksen
- Department of Mechanical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA. .,Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA.,Wyss Institute for Biologically Inspired Engineering, Harvard University, 3 Blackfan Cir, Boston, MA 02115, USA.,Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA 02139, USA.,MIT linQ, Massachusetts Institute of Technology Cambridge, MA 02148, USA
| | - Timothy Kassis
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, 02139, MA, USA
| | - Susan T Conover
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA
| | - Berta Marti-Fuster
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA.,MIT linQ, Massachusetts Institute of Technology Cambridge, MA 02148, USA
| | - Judith S Birkenfeld
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA.,MIT linQ, Massachusetts Institute of Technology Cambridge, MA 02148, USA
| | - Jason Tucker-Schwartz
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA.,MIT linQ, Massachusetts Institute of Technology Cambridge, MA 02148, USA
| | - Asif Naseem
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA.,MIT linQ, Massachusetts Institute of Technology Cambridge, MA 02148, USA
| | - Robert R Stavert
- Division of Dermatology, Cambridge Health Alliance, 1493 Cambridge Street, Cambridge, MA 02139, USA.,Department of Dermatology, Beth Israel Deaconess Medical Center, 330 Brookline Ave, Boston, MA 02215, USA.,Department of Dermatology, Harvard Medical School, 25 Shattuck St, Boston, MA 02115, USA
| | - Caroline C Kim
- Pigmented Lesion Program, Newton Wellesley Dermatology Associates, 65 Walnut Street Suite 520 Wellesley Hills, MA 02481, USA.,Department of Dermatology, Tufts Medical Center, 260 Tremont Street Biewend Building, Boston, MA 02116, USA
| | - Maryanne M Senna
- Department of Dermatology, Harvard Medical School, 25 Shattuck St, Boston, MA 02115, USA.,Department of Dermatology, Massachusetts General Hospital, 55 Fruit St, Boston, MA 02114
| | - José Avilés-Izquierdo
- Department of Dermatology, Hospital General Universitario Gregorio Marañón, Calle del Dr. Esquerdo 46, 28007 Madrid, Spain
| | - James J Collins
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA.,Wyss Institute for Biologically Inspired Engineering, Harvard University, 3 Blackfan Cir, Boston, MA 02115, USA.,Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA 02139, USA.,Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, 02139, MA, USA.,Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.,School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA.,Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology Cambridge, MA 02148, USA
| | - Martha L Gray
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA.,Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA 02139, USA.,MIT linQ, Massachusetts Institute of Technology Cambridge, MA 02148, USA.,Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology Cambridge, MA 02148, USA
| |
Collapse
|
28
|
Birnbaum R, Barzilay R, Brusilov M, Wolman I, Malinger G. Normal cavum veli interpositi at 14-17 gestational weeks: three-dimensional and Doppler transvaginal neurosonographic study. Ultrasound Obstet Gynecol 2021; 58:19-25. [PMID: 32798260 DOI: 10.1002/uog.22176] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Revised: 07/11/2020] [Accepted: 08/10/2020] [Indexed: 06/11/2023]
Abstract
OBJECTIVES To provide evidence to support the hypothesis that the midline cyst-like fluid collection that is frequently observed on fetal brain ultrasound (US) imaging during the early second trimester represents a normal transient cavum veli interpositi (CVI). METHODS This was a retrospective analysis of 89 three-dimensional normal fetal brain volumes, acquired by transvaginal US imaging in 87 pregnant women between 14 and 17 gestational weeks. The midsagittal view was studied using multiplanar imaging, and the maximum length of the fluid collection located over (dorsal to) the tela choroidea of the third ventricle was measured. We calculated the correlation of the transverse cerebellar diameter (TCD) and of the maximum length of the fluid collection with gestational age according to last menstrual period. Color Doppler images were analyzed to determine the location of the internal cerebral veins with respect to the location of the fluid collection. Reports of the second-trimester anatomy scan at 22-24 weeks were also reviewed. RESULTS Interhemispheric fluid collections of various sizes were found in 55% (49/89) of the volumes (mean length, 5 (range, 3.0-7.8) mm). There was a strong correlation between TCD and gestational age (Pearson's correlation, 0.862; P < 0.001). There was no correlation between maximum fluid length and gestational age (Pearson's correlation, -0.442; P = 0.773). Color Doppler images were retrieved in 32 of the 49 fetuses; in 100% of these, the internal cerebral veins coursed within the echogenic roof of the third ventricle. The midline structures were normal at the 22-24-week scan in all cases. CONCLUSIONS In approximately half of normal fetuses, during the evolution of the midline structures of the brain, various degrees of fluid accumulate transiently in the velum interpositum, giving rise to a physiologic CVI. Patients should be reassured that this is a normal phenomenon in the early second trimester that, if an isolated finding, has no influence on fetal brain development. © 2020 International Society of Ultrasound in Obstetrics and Gynecology.
Collapse
Affiliation(s)
- R Birnbaum
- Ob-Gyn Ultrasound Unit, Lis Maternity Hospital, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - R Barzilay
- Lifespan Brain Institute, Penn Medicine and Children's Hospital of Philadelphia (CHOP), Philadelphia, PA, USA
| | - M Brusilov
- Ob-Gyn Ultrasound Unit, Lis Maternity Hospital, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - I Wolman
- Ob-Gyn Ultrasound Unit, Lis Maternity Hospital, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - G Malinger
- Ob-Gyn Ultrasound Unit, Lis Maternity Hospital, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
29
|
Guo J, Ibanez-Lopez AS, Gao H, Quach V, Coley CW, Jensen KF, Barzilay R. Automated Chemical Reaction Extraction from Scientific Literature. J Chem Inf Model 2021; 62:2035-2045. [PMID: 34115937 DOI: 10.1021/acs.jcim.1c00284] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Access to structured chemical reaction data is of key importance for chemists in performing bench experiments and in modern applications like computer-aided drug design. Existing reaction databases are generally populated by human curators through manual abstraction from published literature (e.g., patents and journals), which is time consuming and labor intensive, especially with the exponential growth of chemical literature in recent years. In this study, we focus on developing automated methods for extracting reactions from chemical literature. We consider journal publications as the target source of information, which are more comprehensive and better represent the latest developments in chemistry compared to patents; however, they are less formulaic in their descriptions of reactions. To implement the reaction extraction system, we first devised a chemical reaction schema, primarily including a central product, and a set of associated reaction roles such as reactants, catalyst, solvent, and so on. We formulate the task as a structure prediction problem and solve it with a two-stage deep learning framework consisting of product extraction and reaction role labeling. Both models are built upon Transformer-based encoders, which are adaptively pretrained using domain and task-relevant unlabeled data. Our models are shown to be both effective and data efficient, achieving an F1 score of 76.2% in product extraction and 78.7% in role extraction, with only hundreds of annotated reactions.
Collapse
Affiliation(s)
- Jiang Guo
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - A Santiago Ibanez-Lopez
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - Hanyu Gao
- Department of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| | - Victor Quach
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| | - Klavs F Jensen
- Department of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
30
|
Abstract
Introduction: Artificial Intelligence (AI) has become a component of our everyday lives, with applications ranging from recommendations on what to buy to the analysis of radiology images. Many of the techniques originally developed for other fields such as language translation and computer vision are now being applied in drug discovery. AI has enabled multiple aspects of drug discovery including the analysis of high content screening data, and the design and synthesis of new molecules.Areas covered: This perspective provides an overview of the application of AI in several areas relevant to drug discovery including property prediction, molecule generation, image analysis, and organic synthesis planning.Expert opinion: While a variety of machine learning methods are now being routinely used to predict biological activity and ADME properties, methods of representing molecules continue to evolve. Molecule generation methods are relatively new and unproven but hold the potential to access new, unexplored areas of chemical space. The application of AI in drug discovery will continue to benefit from dedicated research, as well as AI developments in other fields. With this pairing algorithmic advancements and high-quality data, the impact of AI in drug discovery will continue to grow in the coming years.
Collapse
Affiliation(s)
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA
| |
Collapse
|
31
|
Dontchos BN, Yala A, Barzilay R, Xiang J, Lehman CD. External Validation of a Deep Learning Model for Predicting Mammographic Breast Density in Routine Clinical Practice. Acad Radiol 2021; 28:475-480. [PMID: 32089465 DOI: 10.1016/j.acra.2019.12.012] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Revised: 12/11/2019] [Accepted: 12/12/2019] [Indexed: 11/29/2022]
Abstract
RATIONALE AND OBJECTIVES Federal legislation requires patient notification of dense mammographic breast tissue because increased density is a marker of breast cancer risk and can limit the sensitivity of mammography. As previously described, we clinically implemented our deep learning model at the academic breast imaging practice where the model was developed with high clinical acceptance. Our objective was to externally validate our deep learning model on radiologist breast density assessments in a community breast imaging practice. MATERIALS AND METHODS Our deep learning model was implemented at a dedicated breast imaging practice staffed by both academic and community breast imaging radiologists in October 2018. Deep learning model assessment of mammographic breast density was presented to the radiologist during routine clinical practice at the time of mammogram interpretation. We identified 2174 consecutive screening mammograms after implementation of the deep learning model. Radiologist agreement with the model's assessment was measured and compared across radiologist groups. RESULTS Both academic and community radiologists had high clinical acceptance of the deep learning model's density prediction, with 94.9% (academic) and 90.7% (community) acceptance for dense versus nondense categories (p < 0.001). The proportion of mammograms assessed as dense by all radiologists decreased from 47.0% before deep learning model implementation to 41.0% after deep learning model implementation (p < 0.001). CONCLUSION Our deep learning model had a high clinical acceptance rate among both academic and community radiologists and reduced the proportion of mammograms assessed as dense. This is an important step to validating our deep learning model prior to potential widespread implementation.
Collapse
Affiliation(s)
- Brian N Dontchos
- Massachusetts General Hospital, 55 Fruit Street, WAC-240, Boston, MA 02114.
| | - Adam Yala
- Massachusetts Institute of Technology, Cambridge, Massachusetts
| | - Regina Barzilay
- Massachusetts Institute of Technology, Cambridge, Massachusetts
| | - Justin Xiang
- Massachusetts Institute of Technology, Cambridge, Massachusetts
| | - Constance D Lehman
- Massachusetts General Hospital, 55 Fruit Street, WAC-240, Boston, MA 02114
| |
Collapse
|
32
|
Barzilay R, Yala A. Abstract IA-24: Towards robust image based models for cancer risk assessment. Clin Cancer Res 2021. [DOI: 10.1158/1557-3265.adi21-ia-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
To enable targeted personalized screening, we need to advance cancer risk models. Despite significant research on this topic, statistical models used today in clinical practice routinely undeperform. In this talk, I explore the potential of AI-based models for breast cancer prediction based on mammographic images. These models have already shown significantt performance gains over their counterparts. However, to bring deep learning models to clinical practice, we need to further refine their accuracy, validate them across diverse populations, and demonstrate their potential to improve clinical workflows. To this end, we propose Mirai, a new risk algorithm designed to predict risk at multiple time points, leverage potentially missing risk-factor information, and produce predictions that are consistent across mammography machines. The architectture of the new model will be covered in detail in the talk. Mirai was trained on a large dataset from Massachusetts General Hospital (MGH) in the US and was tested on held-out test sets from MGH, Karolinska in Sweden and Chang Gung Memorial Hospital in Taiwan, obtaining C-indices of 0.76 (95% CI 0.74, 0.80), 0.81 (0.79, 0.82), 0.79 (0.79, 0.83), respectively. Mirai obtained significantly higher five-year ROC AUCs than the Tyrer-Cuzick model (p<0.001) and prior deep learning models, Hybrid DL (p<0.001) and ImageOnly DL (p<0.001), trained on the same MGH dataset.
Citation Format: Regina Barzilay, Adam Yala. Towards robust image based models for cancer risk assessment [abstract]. In: Proceedings of the AACR Virtual Special Conference on Artificial Intelligence, Diagnosis, and Imaging; 2021 Jan 13-14. Philadelphia (PA): AACR; Clin Cancer Res 2021;27(5_Suppl):Abstract nr IA-24.
Collapse
|
33
|
Abstract
Abstract
Innovative methods of risk assessment that leverage the strength of Artificial Intelligence (AI) are essential to propel the goals of precision prevention forward. Since the creation of the Gail model in 1989, risk models have supported risk-adjusted screening and prevention, and their continued evolution has been a central pillar of breast cancer research. Prior research has explored multiple risk factors related to hormonal and genetic information. One factor that has received substantial attention is mammographic breast density. Incorporating mammographic breast density into clinically used models such as the Gail and Tyrer-Cuzick risk models significantly improves prediction and discrimination. However, current risk models are limited in that they incorporate only a small fraction of data available on any given patient. Using breast density as a proxy for the detailed information embedded in the mammogram is extremely limited, as breast density assessment is subjective, varies widely across radiologists, and restricts the rich information contained in the digital images to a single crude value. Patients of the same age assigned the same density score can have mammogram images that appear drastically different and can have very different future risk profiles. While previous studies have explored automated methods to assess breast density, these efforts reduce the complex data contained in the mammogram into a few statistics, which are not sufficiently rich to distinguish patients who will and will not develop breast cancer. Deep learning models can operate over full resolution mammogram images to assess a patient’s future breast cancer risk. Rather than manually identifying discriminative image patterns, machine learning models can discover these patterns directly from the data. Specifically, models are trained with full resolution mammograms and the outcome of interest, namely whether the patient developed breast cancer within five years from the date of the examination. Our recent work demonstrates that application of novel artificial intelligence applications to imaging data can significantly improve breast cancer risk prediction. In addition, unlike traditional models, our DL model performs equally well across varied races, ages, and family histories and we have built a clinical platform which is currently in use to support implementation of our risk model into clinical care. The COVID-19 pandemic has revealed severe inequities in healthcare while providing opportunities for essential reform. In breast cancer care, preliminary, conservative estimates predict the disruption of breast cancer screening due to the COVID-19 pandemic will result in a significant upward stage shift of cancers diagnosed and more than 5,000 breast cancer deaths in the U.S. alone.
Due to severely limited healthcare resources during pandemics, and to protect patients and healthcare workers, state governments urge providers to focus cancer screening efforts on those patients at higher risk. These mandates are necessary responses to support fair allocation of scarce resources to maximize benefits for all patients across the full spectrum of healthcare needs. AI-based breast cancer risk models have the potential to support more effective and more equitable mammographic screening for breast cancer during these times of severely restricted access to screening.
ROC Area Under the Curve Analyses of Traditional vs AI Risk Models
Risk ModelTyrer-Cuzick version 8 AUCAI Image Only AUCRaceAfrican American0.58 (0.39, 0.79)0.74 (0.60, 0.90)Asian0.53 (0.35, 0.74)0.79 (0.68, 0.94)White0.64 (0.60, 0.68)0.77 (0.73, 0.80)Age<500.65 (0.57, 0.72)0.75 (0.68, 0.82)50-700.64 (0.60, 0.69)0.76 (0.72, 0.79)>700.52 (0.43, 0.60)0.77 (0.70, 0.84)DensityNon-Dense0.63 (0.58, 0.68)0.77 (0.73, 0.81)Dense0.63 (0.58, 0.69)0.77 (0.73, 0.81)
Citation Format: C Lehman, A Yala, L Lamb, R Barzilay. Hidden clues in the mammogram: How AI can improve early breast cancer detection [abstract]. In: Proceedings of the 2020 San Antonio Breast Cancer Virtual Symposium; 2020 Dec 8-11; San Antonio, TX. Philadelphia (PA): AACR; Cancer Res 2021;81(4 Suppl):Abstract nr SP080.
Collapse
Affiliation(s)
- C Lehman
- 1Harvard Medical School Mass General Hospital, Boston, MA
| | | | - L Lamb
- 1Harvard Medical School Mass General Hospital, Boston, MA
| | | |
Collapse
|
34
|
Abstract
Recent advances in computer hardware and software have led to a revolution in deep neural networks that has impacted fields ranging from language translation to computer vision. Deep learning has also impacted a number of areas in drug discovery, including the analysis of cellular images and the design of novel routes for the synthesis of organic molecules. While work in these areas has been impactful, a complete review of the applications of deep learning in drug discovery would be beyond the scope of a single Account. In this Account, we will focus on two key areas where deep learning has impacted molecular design: the prediction of molecular properties and the de novo generation of suggestions for new molecules.One of the most significant advances in the development of quantitative structure-activity relationships (QSARs) has come from the application of deep learning methods to the prediction of the biological activity and physical properties of molecules in drug discovery programs. Rather than employing the expert-derived chemical features typically used to build predictive models, researchers are now using deep learning to develop novel molecular representations. These representations, coupled with the ability of deep neural networks to uncover complex, nonlinear relationships, have led to state-of-the-art performance. While deep learning has changed the way that many researchers approach QSARs, it is not a panacea. As with any other machine learning task, the design of predictive models is dependent on the quality, quantity, and relevance of available data. Seemingly fundamental issues, such as optimal methods for creating a training set, are still open questions for the field. Another critical area that is still the subject of multiple research efforts is the development of methods for assessing the confidence in a model.Deep learning has also contributed to a renaissance in the application of de novo molecule generation. Rather than relying on manually defined heuristics, deep learning methods learn to generate new molecules based on sets of existing molecules. Techniques that were originally developed for areas such as image generation and language translation have been adapted to the generation of molecules. These deep learning methods have been coupled with the predictive models described above and are being used to generate new molecules with specific predicted biological activity profiles. While these generative algorithms appear promising, there have been only a few reports on the synthesis and testing of molecules based on designs proposed by generative models. The evaluation of the diversity, quality, and ultimate value of molecules produced by generative models is still an open question. While the field has produced a number of benchmarks, it has yet to agree on how one should ultimately assess molecules "invented" by an algorithm.
Collapse
Affiliation(s)
- W. Patrick Walters
- Relay Therapeutics, 399 Binney Street, Cambridge, Massachusetts 02142, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
35
|
Arbour KC, Luu AT, Luo J, Rizvi H, Plodkowski AJ, Sakhi M, Huang KB, Digumarthy SR, Ginsberg MS, Girshman J, Kris MG, Riely GJ, Yala A, Gainor JF, Barzilay R, Hellmann MD. Deep Learning to Estimate RECIST in Patients with NSCLC Treated with PD-1 Blockade. Cancer Discov 2020; 11:59-67. [DOI: 10.1158/2159-8290.cd-20-0419] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2020] [Revised: 07/10/2020] [Accepted: 09/16/2020] [Indexed: 11/16/2022]
|
36
|
Wang X, Qian Y, Gao H, Coley CW, Mo Y, Barzilay R, Jensen KF. Towards efficient discovery of green synthetic pathways with Monte Carlo tree search and reinforcement learning. Chem Sci 2020; 11:10959-10972. [PMID: 34094345 PMCID: PMC8162445 DOI: 10.1039/d0sc04184j] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 09/11/2020] [Indexed: 12/25/2022] Open
Abstract
Computer aided synthesis planning of synthetic pathways with green process conditions has become of increasing importance in organic chemistry, but the large search space inherent in synthesis planning and the difficulty in predicting reaction conditions make it a significant challenge. We introduce a new Monte Carlo Tree Search (MCTS) variant that promotes balance between exploration and exploitation across the synthesis space. Together with a value network trained from reinforcement learning and a solvent-prediction neural network, our algorithm is comparable to the best MCTS variant (PUCT, similar to Google's Alpha Go) in finding valid synthesis pathways within a fixed searching time, and superior in identifying shorter routes with greener solvents under the same search conditions. In addition, with the same root compound visit count, our algorithm outperforms the PUCT MCTS by 16% in terms of determining successful routes. Overall the success rate is improved by 19.7% compared to the upper confidence bound applied to trees (UCT) MCTS method. Moreover, we improve 71.4% of the routes proposed by the PUCT MCTS variant in pathway length and choices of green solvents. The approach generally enables including Green Chemistry considerations in computer aided synthesis planning with potential applications in process development for fine chemicals or pharmaceuticals.
Collapse
Affiliation(s)
- Xiaoxue Wang
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
- Department of Chemical and Biomolecular Engineering, The Ohio State University Columbus Ohio 43210 USA
| | - Yujie Qian
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| | - Hanyu Gao
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| | - Yiming Mo
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| | - Regina Barzilay
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| | - Klavs F Jensen
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| |
Collapse
|
37
|
Stokes JM, Yang K, Swanson K, Jin W, Cubillos-Ruiz A, Donghia NM, MacNair CR, French S, Carfrae LA, Bloom-Ackermann Z, Tran VM, Chiappino-Pepe A, Badran AH, Andrews IW, Chory EJ, Church GM, Brown ED, Jaakkola TS, Barzilay R, Collins JJ. A Deep Learning Approach to Antibiotic Discovery. Cell 2020; 180:688-702.e13. [PMID: 32084340 DOI: 10.1016/j.cell.2020.01.021] [Citation(s) in RCA: 660] [Impact Index Per Article: 165.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Revised: 12/04/2019] [Accepted: 01/15/2020] [Indexed: 02/06/2023]
Abstract
Due to the rapid emergence of antibiotic-resistant bacteria, there is a growing need to discover new antibiotics. To address this challenge, we trained a deep neural network capable of predicting molecules with antibacterial activity. We performed predictions on multiple chemical libraries and discovered a molecule from the Drug Repurposing Hub-halicin-that is structurally divergent from conventional antibiotics and displays bactericidal activity against a wide phylogenetic spectrum of pathogens including Mycobacterium tuberculosis and carbapenem-resistant Enterobacteriaceae. Halicin also effectively treated Clostridioides difficile and pan-resistant Acinetobacter baumannii infections in murine models. Additionally, from a discrete set of 23 empirically tested predictions from >107 million molecules curated from the ZINC15 database, our model identified eight antibacterial compounds that are structurally distant from known antibiotics. This work highlights the utility of deep learning approaches to expand our antibiotic arsenal through the discovery of structurally distinct antibacterial molecules.
Collapse
Affiliation(s)
- Jonathan M Stokes
- Department of Biological Engineering, Synthetic Biology Center, Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Machine Learning for Pharmaceutical Discovery and Synthesis Consortium, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Kevin Yang
- Machine Learning for Pharmaceutical Discovery and Synthesis Consortium, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Kyle Swanson
- Machine Learning for Pharmaceutical Discovery and Synthesis Consortium, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Wengong Jin
- Machine Learning for Pharmaceutical Discovery and Synthesis Consortium, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Andres Cubillos-Ruiz
- Department of Biological Engineering, Synthetic Biology Center, Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Nina M Donghia
- Department of Biological Engineering, Synthetic Biology Center, Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Craig R MacNair
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ON L8N 3Z5, Canada
| | - Shawn French
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ON L8N 3Z5, Canada
| | - Lindsey A Carfrae
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ON L8N 3Z5, Canada
| | - Zohar Bloom-Ackermann
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | - Victoria M Tran
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Anush Chiappino-Pepe
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | - Ahmed H Badran
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Ian W Andrews
- Department of Biological Engineering, Synthetic Biology Center, Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Emma J Chory
- Department of Biological Engineering, Synthetic Biology Center, Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - George M Church
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA 02139, USA
| | - Eric D Brown
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ON L8N 3Z5, Canada
| | - Tommi S Jaakkola
- Machine Learning for Pharmaceutical Discovery and Synthesis Consortium, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Regina Barzilay
- Machine Learning for Pharmaceutical Discovery and Synthesis Consortium, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Abdul Latif Jameel Clinic for Machine Learning in Health, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - James J Collins
- Department of Biological Engineering, Synthetic Biology Center, Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA 02139, USA; Abdul Latif Jameel Clinic for Machine Learning in Health, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| |
Collapse
|
38
|
Hirschfeld L, Swanson K, Yang K, Barzilay R, Coley CW. Uncertainty Quantification Using Neural Networks for Molecular Property Prediction. J Chem Inf Model 2020; 60:3770-3780. [PMID: 32702986 DOI: 10.1021/acs.jcim.0c00502] [Citation(s) in RCA: 72] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Uncertainty quantification (UQ) is an important component of molecular property prediction, particularly for drug discovery applications where model predictions direct experimental design and where unanticipated imprecision wastes valuable time and resources. The need for UQ is especially acute for neural models, which are becoming increasingly standard yet are challenging to interpret. While several approaches to UQ have been proposed in the literature, there is no clear consensus on the comparative performance of these models. In this paper, we study this question in the context of regression tasks. We systematically evaluate several methods on five regression data sets using multiple complementary performance metrics. Our experiments show that none of the methods we tested is unequivocally superior to all others, and none produces a particularly reliable ranking of errors across multiple data sets. While we believe that these results show that existing UQ methods are not sufficient for all common use cases and further research is needed, we conclude with a practical recommendation as to which existing techniques seem to perform well relative to others.
Collapse
Affiliation(s)
- Lior Hirschfeld
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - Kyle Swanson
- Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge CB3 0WB, U.K
| | - Kevin Yang
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, California 94720, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
39
|
Deng Z, Yin K, Bao Y, Armengol VD, Wang C, Tiwari A, Barzilay R, Parmigiani G, Braun D, Hughes KS. Validation of a Semiautomated Natural Language Processing-Based Procedure for Meta-Analysis of Cancer Susceptibility Gene Penetrance. JCO Clin Cancer Inform 2020; 3:1-9. [PMID: 31419182 DOI: 10.1200/cci.19.00043] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Quantifying the risk of cancer associated with pathogenic mutations in germline cancer susceptibility genes-that is, penetrance-enables the personalization of preventive management strategies. Conducting a meta-analysis is the best way to obtain robust risk estimates. We have previously developed a natural language processing (NLP) -based abstract classifier which classifies abstracts as relevant to penetrance, prevalence of mutations, both, or neither. In this work, we evaluate the performance of this NLP-based procedure. MATERIALS AND METHODS We compared the semiautomated NLP-based procedure, which involves automated abstract classification and text mining, followed by human review of identified studies, with the traditional procedure that requires human review of all studies. Ten high-quality gene-cancer penetrance meta-analyses spanning 16 gene-cancer associations were used as the gold standard by which to evaluate the performance of our procedure. For each meta-analysis, we evaluated the number of abstracts that required human review (workload) and the ability to identify the studies that were included by the authors in their quantitative analysis (coverage). RESULTS Compared with the traditional procedure, the semiautomated NLP-based procedure led to a lower workload across all 10 meta-analyses, with an overall 84% reduction (2,774 abstracts v 16,941 abstracts) in the amount of human review required. Overall coverage was 93%-we are able to identify 132 of 142 studies-before reviewing references of identified studies. Reasons for the 10 missed studies included blank and poorly written abstracts. After reviewing references, nine of the previously missed studies were identified and coverage improved to 99% (141 of 142 studies). CONCLUSION We demonstrated that an NLP-based procedure can significantly reduce the review workload without compromising the ability to identify relevant studies. NLP algorithms have promising potential for reducing human efforts in the literature review process.
Collapse
Affiliation(s)
| | - Kanhua Yin
- Massachusetts General Hospital, Boston, MA
| | - Yujia Bao
- Massachusetts Institute of Technology, Boston, MA
| | | | - Cathy Wang
- Harvard TH Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | | | | | - Giovanni Parmigiani
- Harvard TH Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | - Danielle Braun
- Harvard TH Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | - Kevin S Hughes
- Massachusetts General Hospital, Boston, MA.,Harvard Medical School, Boston, MA
| |
Collapse
|
40
|
Santus E, Li C, Yala A, Peck D, Soomro R, Faridi N, Mamshad I, Tang R, Lanahan CR, Barzilay R, Hughes K. Do Neural Information Extraction Algorithms Generalize Across Institutions? JCO Clin Cancer Inform 2020; 3:1-8. [PMID: 31310566 DOI: 10.1200/cci.18.00160] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Natural language processing (NLP) techniques have been adopted to reduce the curation costs of electronic health records. However, studies have questioned whether such techniques can be applied to data from previously unseen institutions. We investigated the performance of a common neural NLP algorithm on data from both known and heldout (ie, institutions whose data were withheld from the training set and only used for testing) hospitals. We also explored how diversity in the training data affects the system's generalization ability. METHODS We collected 24,881 breast pathology reports from seven hospitals and manually annotated them with nine key attributes that describe types of atypia and cancer. We trained a convolutional neural network (CNN) on annotations from either only one (CNN1), only two (CNN2), or only four (CNN4) hospitals. The trained systems were tested on data from five organizations, including both known and heldout ones. For every setting, we provide the accuracy scores as well as the learning curves that show how much data are necessary to achieve good performance and generalizability. RESULTS The system achieved a cross-institutional accuracy of 93.87% when trained on reports from only one hospital (CNN1). Performance improved to 95.7% and 96%, respectively, when the system was trained on reports from two (CNN2) and four (CNN4) hospitals. The introduction of diversity during training did not lead to improvements on the known institutions, but it boosted performance on the heldout institutions. When tested on reports from heldout hospitals, CNN4 outperformed CNN1 and CNN2 by 2.13% and 0.3%, respectively. CONCLUSION Real-world scenarios require that neural NLP approaches scale to data from previously unseen institutions. We show that a common neural NLP algorithm for information extraction can achieve this goal, especially when diverse data are used during training.
Collapse
Affiliation(s)
- Enrico Santus
- Massachusetts Institute of Technology, Cambridge, MA
| | - Clara Li
- Massachusetts Institute of Technology, Cambridge, MA
| | - Adam Yala
- Massachusetts Institute of Technology, Cambridge, MA
| | - Donald Peck
- Henry Ford Health System, Detroit, MI.,Michigan Technological University, Houghton, MI
| | - Rufina Soomro
- Liaquat National Hospital & Medical College, Karachi, Pakistan
| | - Naveen Faridi
- Liaquat National Hospital & Medical College, Karachi, Pakistan
| | - Isra Mamshad
- Liaquat National Hospital & Medical College, Karachi, Pakistan
| | - Rong Tang
- Rochester General Hospital, Rochester, NY
| | | | | | | |
Collapse
|
41
|
Bao Y, Deng Z, Wang Y, Kim H, Armengol VD, Acevedo F, Ouardaoui N, Wang C, Parmigiani G, Barzilay R, Braun D, Hughes KS. Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes. JCO Clin Cancer Inform 2020; 3:1-9. [PMID: 31545655 DOI: 10.1200/cci.19.00042] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
PURPOSE The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools that help to monitor and prioritize the literature to understand the clinical implications of pathogenic genetic variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the penetrance-risk of cancer for germline mutation carriers-or prevalence of germline genetic mutations. MATERIALS AND METHODS We conducted literature searches in PubMed and retrieved paper titles and abstracts to create an annotated data set for training and evaluating the two machine learning classification models. Our first model is a support vector machine (SVM) which learns a linear decision rule on the basis of the bag-of-ngrams representation of each title and abstract. Our second model is a convolutional neural network (CNN) which learns a complex nonlinear decision rule on the basis of the raw title and abstract. We evaluated the performance of the two models on the classification of papers as relevant to penetrance or prevalence. RESULTS For penetrance classification, we annotated 3,740 paper titles and abstracts and evaluated the two models using 10-fold cross-validation. The SVM model achieved 88.93% accuracy-percentage of papers that were correctly classified-whereas the CNN model achieved 88.53% accuracy. For prevalence classification, we annotated 3,753 paper titles and abstracts. The SVM model achieved 88.92% accuracy and the CNN model achieved 88.52% accuracy. CONCLUSION Our models achieve high accuracy in classifying abstracts as relevant to penetrance or prevalence. By facilitating literature review, this tool could help clinicians and researchers keep abreast of the burgeoning knowledge of gene-cancer associations and keep the knowledge bases for clinical decision support tools up to date.
Collapse
Affiliation(s)
- Yujia Bao
- Massachusetts Institute of Technology, Boston, MA
| | | | - Yan Wang
- Massachusetts General Hospital, Boston, MA
| | - Heeyoon Kim
- Massachusetts Institute of Technology, Boston, MA
| | | | | | | | - Cathy Wang
- Harvard T.H. Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | - Giovanni Parmigiani
- Harvard T.H. Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | | | - Danielle Braun
- Harvard T.H. Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | - Kevin S Hughes
- Massachusetts General Hospital, Boston, MA.,Harvard Medical School, Boston, MA
| |
Collapse
|
42
|
Struble TJ, Alvarez JC, Brown SP, Chytil M, Cisar J, DesJarlais RL, Engkvist O, Frank SA, Greve DR, Griffin DJ, Hou X, Johannes JW, Kreatsoulas C, Lahue B, Mathea M, Mogk G, Nicolaou CA, Palmer AD, Price DJ, Robinson RI, Salentin S, Xing L, Jaakkola T, Green WH, Barzilay R, Coley CW, Jensen KF. Current and Future Roles of Artificial Intelligence in Medicinal Chemistry Synthesis. J Med Chem 2020; 63:8667-8682. [PMID: 32243158 PMCID: PMC7457232 DOI: 10.1021/acs.jmedchem.9b02120] [Citation(s) in RCA: 79] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
![]()
Artificial
intelligence and machine learning have demonstrated
their potential role in predictive chemistry and synthetic planning
of small molecules; there are at least a few reports of companies
employing in silico synthetic planning into their
overall approach to accessing target molecules. A data-driven synthesis
planning program is one component being developed and evaluated by
the Machine Learning for Pharmaceutical Discovery and Synthesis (MLPDS)
consortium, comprising MIT and 13 chemical and pharmaceutical company
members. Together, we wrote this perspective to share how we think
predictive models can be integrated into medicinal chemistry synthesis
workflows, how they are currently used within MLPDS member companies,
and the outlook for this field.
Collapse
Affiliation(s)
- Thomas J Struble
- Department of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| | - Juan C Alvarez
- Computational and Structural Chemistry, Merck & Co. Inc., Kenilworth, New Jersey 07033, United States
| | - Scott P Brown
- Sunovion Pharmaceuticals Inc., Marlborough, Massachusetts 01752, United States
| | - Milan Chytil
- Sunovion Pharmaceuticals Inc., Marlborough, Massachusetts 01752, United States
| | - Justin Cisar
- Janssen Research & Development LLC, Spring House, Pennsylvania 19477, United States
| | - Renee L DesJarlais
- Janssen Research & Development LLC, Spring House, Pennsylvania 19477, United States
| | - Ola Engkvist
- Hit Discovery, Discovery Sciences, R&D, AstraZeneca, 431 83 Mölndal, Sweden
| | - Scott A Frank
- Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Daniel R Greve
- LEO Pharma A/S, Industriparken 55, DK-2750 Ballerup, Denmark
| | | | - Xinjun Hou
- Pfizer Inc., Cambridge, Massachusetts 02139, United States
| | - Jeffrey W Johannes
- Medicinal Chemistry, Early Oncology, Oncology R&D, AstraZeneca, Boston, Massachusetts 02451, United States
| | | | - Brian Lahue
- Computational and Structural Chemistry, Merck & Co. Inc., Kenilworth, New Jersey 07033, United States
| | - Miriam Mathea
- BASF SE, Carl-Bosch-Strasse 38, 67056 Ludwigshafen am Rhein, Germany
| | | | | | - Andrew D Palmer
- BASF SE, Carl-Bosch-Strasse 38, 67056 Ludwigshafen am Rhein, Germany
| | - Daniel J Price
- GlaxoSmithKline, Collegeville, Pennsylvania 19426, United States
| | - Richard I Robinson
- Novartis Institutes for BioMedical Research, Cambridge, Massachusetts 02139, United States
| | | | - Li Xing
- WuXi AppTec, Cambridge, Massachusetts 02142, United States
| | - Tommi Jaakkola
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - William H Green
- Department of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| | - Klavs F Jensen
- Department of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
43
|
Stokes JM, Yang K, Swanson K, Jin W, Cubillos-Ruiz A, Donghia NM, MacNair CR, French S, Carfrae LA, Bloom-Ackermann Z, Tran VM, Chiappino-Pepe A, Badran AH, Andrews IW, Chory EJ, Church GM, Brown ED, Jaakkola TS, Barzilay R, Collins JJ. A Deep Learning Approach to Antibiotic Discovery. Cell 2020; 181:475-483. [DOI: 10.1016/j.cell.2020.04.001] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
44
|
Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, Palmer A, Settels V, Jaakkola T, Jensen K, Barzilay R. Correction to Analyzing Learned Molecular Representations for Property Prediction. J Chem Inf Model 2019; 59:5304-5305. [PMID: 31814400 PMCID: PMC8154261 DOI: 10.1021/acs.jcim.9b01076] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Kevin Yang
- Computer Science and Artificial Intelligence Laboratory , MIT , Cambridge , Massachusetts 02139 , United States
| | - Kyle Swanson
- Computer Science and Artificial Intelligence Laboratory , MIT , Cambridge , Massachusetts 02139 , United States
| | - Wengong Jin
- Computer Science and Artificial Intelligence Laboratory , MIT , Cambridge , Massachusetts 02139 , United States
| | - Connor Coley
- Department of Chemical Engineering , MIT , Cambridge , Massachusetts 02139 , United States
| | | | - Hua Gao
- Amgen Inc. , Cambridge , Massachusetts 02141 , United States
| | | | - Timothy Hopper
- Amgen Inc. , Cambridge , Massachusetts 02141 , United States
| | - Brian Kelley
- Novartis Institutes for BioMedical Research , Cambridge , Massachusetts 02139 , United States
| | | | | | | | - Tommi Jaakkola
- Computer Science and Artificial Intelligence Laboratory , MIT , Cambridge , Massachusetts 02139 , United States
| | - Klavs Jensen
- Department of Chemical Engineering , MIT , Cambridge , Massachusetts 02139 , United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory , MIT , Cambridge , Massachusetts 02139 , United States
| |
Collapse
|
45
|
Hu SY, Santus E, Forsyth AW, Malhotra D, Haimson J, Chatterjee NA, Kramer DB, Barzilay R, Tulsky JA, Lindvall C. Can machine learning improve patient selection for cardiac resynchronization therapy? PLoS One 2019; 14:e0222397. [PMID: 31581234 PMCID: PMC6776390 DOI: 10.1371/journal.pone.0222397] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Accepted: 08/28/2019] [Indexed: 12/25/2022] Open
Abstract
RATIONALE Multiple clinical trials support the effectiveness of cardiac resynchronization therapy (CRT); however, optimal patient selection remains challenging due to substantial treatment heterogeneity among patients who meet the clinical practice guidelines. OBJECTIVE To apply machine learning to create an algorithm that predicts CRT outcome using electronic health record (EHR) data avaible before the procedure. METHODS AND RESULTS We applied machine learning and natural language processing to the EHR of 990 patients who received CRT at two academic hospitals between 2004-2015. The primary outcome was reduced CRT benefit, defined as <0% improvement in left ventricular ejection fraction (LVEF) 6-18 months post-procedure or death by 18 months. Data regarding demographics, laboratory values, medications, clinical characteristics, and past health services utilization were extracted from the EHR available before the CRT procedure. Bigrams (i.e., two-word sequences) were also extracted from the clinical notes using natural language processing. Patients accrued on average 75 clinical notes (SD, 29) before the procedure including data not captured anywhere else in the EHR. A machine learning model was built using 80% of the patient sample (training and validation dataset), and tested on a held-out 20% patient sample (test dataset). Among 990 patients receiving CRT the mean age was 71.6 (SD, 11.8), 78.1% were male, 87.2% non-Hispanic white, and the mean baseline LVEF was 24.8% (SD, 7.69). Out of 990 patients, 403 (40.7%) were identified as having a reduced benefit from the CRT device (<0% LVEF improvement in 25.2%, death by 18 months in 15.6%). The final model identified 26% of these patients at a positive predictive value of 79% (model performance: Fβ (β = 0.1): 77%; recall 0.26; precision 0.79; accuracy 0.65). CONCLUSIONS A machine learning model that leveraged readily available EHR data and clinical notes identified a subset of CRT patients who may not benefit from CRT before the procedure.
Collapse
Affiliation(s)
- Szu-Yeu Hu
- Department of Radiology, Masachusetts General Hospital, Boston, Massachusetts, United States of America
| | - Enrico Santus
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, Massachusetts, United States of America
| | - Alexander W. Forsyth
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, Massachusetts, United States of America
| | - Devvrat Malhotra
- Department of Health Policy and Management, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Josh Haimson
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, Massachusetts, United States of America
| | - Neal A. Chatterjee
- Division of Cardiology, Department of Medicine, University of Washington, Seattle, Washington, United States of America
| | - Daniel B. Kramer
- Richard A. and Susan F. Smith Center for Outcomes Research, Division of Cardiology, Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America
| | - Regina Barzilay
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, Massachusetts, United States of America
| | - James A. Tulsky
- Department of Psychosocial Oncology and Palliative Care, Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
- Division of Palliative Medicine, Department of Medicine, Brigham and Women’s Hospital, Boston, Massachusetts, United States of America
| | - Charlotta Lindvall
- Department of Psychosocial Oncology and Palliative Care, Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
- Division of Palliative Medicine, Department of Medicine, Brigham and Women’s Hospital, Boston, Massachusetts, United States of America
| |
Collapse
|
46
|
Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, Palmer A, Settels V, Jaakkola T, Jensen K, Barzilay R. Analyzing Learned Molecular Representations for Property Prediction. J Chem Inf Model 2019; 59:3370-3388. [PMID: 31361484 PMCID: PMC6727618 DOI: 10.1021/acs.jcim.9b00237] [Citation(s) in RCA: 533] [Impact Index Per Article: 106.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2019] [Indexed: 12/23/2022]
Abstract
Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, we benchmark models extensively on 19 public and 16 proprietary industrial data sets spanning a wide variety of chemical end points. In addition, we introduce a graph convolutional model that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary data sets. Our empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, our proposed model nevertheless offers significant improvements over models currently used in industrial workflows.
Collapse
Affiliation(s)
- Kevin Yang
- Computer
Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - Kyle Swanson
- Computer
Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - Wengong Jin
- Computer
Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - Connor Coley
- Department
of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| | | | - Hua Gao
- Amgen Inc., Cambridge, Massachusetts 02141, United States
| | | | - Timothy Hopper
- Amgen Inc., Cambridge, Massachusetts 02141, United States
| | - Brian Kelley
- Novartis
Institutes
for BioMedical Research, Cambridge, Massachusetts 02139, United States
| | | | | | | | - Tommi Jaakkola
- Computer
Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - Klavs Jensen
- Department
of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer
Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
47
|
Abstract
Background Recent deep learning (DL) approaches have shown promise in improving sensitivity but have not addressed limitations in radiologist specificity or efficiency. Purpose To develop a DL model to triage a portion of mammograms as cancer free, improving performance and workflow efficiency. Materials and Methods In this retrospective study, 223 109 consecutive screening mammograms performed in 66 661 women from January 2009 to December 2016 were collected with cancer outcomes obtained through linkage to a regional tumor registry. This cohort was split by patient into 212 272, 25 999, and 26 540 mammograms from 56 831, 7021, and 7176 patients for training, validation, and testing, respectively. A DL model was developed to triage mammograms as cancer free and evaluated on the test set. A DL-triage workflow was simulated in which radiologists skipped mammograms triaged as cancer free (interpreting them as negative for cancer) and read mammograms not triaged as cancer free by using the original interpreting radiologists' assessments. Sensitivities, specificities, and percentage of mammograms read were calculated, with and without the DL-triage-simulated workflow. Statistics were computed across 5000 bootstrap samples to assess confidence intervals (CIs). Specificities were compared by using a two-tailed t test (P < .05) and sensitivities were compared by using a one-sided t test with a noninferiority margin of 5% (P < .05). Results The test set included 7176 women (mean age, 57.8 years ± 10.9 [standard deviation]). When reading all mammograms, radiologists obtained a sensitivity and specificity of 90.6% (173 of 191; 95% CI: 86.6%, 94.7%) and 93.5% (24 625 of 26 349; 95% CI: 93.3%, 93.9%). In the DL-simulated workflow, the radiologists obtained a sensitivity and specificity of 90.1% (172 of 191; 95% CI: 86.0%, 94.3%) and 94.2% (24 814 of 26 349; 95% CI: 94.0%, 94.6%) while reading 80.7% (21 420 of 26 540) of the mammograms. The simulated workflow improved specificity (P = .002) and obtained a noninferior sensitivity with a margin of 5% (P < .001). Conclusion This deep learning model has the potential to reduce radiologist workload and significantly improve specificity without harming sensitivity. © RSNA, 2019 Online supplemental material is available for this article. See also the editorial by Kontos and Conant in this issue.
Collapse
Affiliation(s)
- Adam Yala
- From the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass (A.Y., T.S., R.B.); and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, 55 Fruit St, WAC 240, Boston, Mass 02114-2698 (R.M., C.L.)
| | - Tal Schuster
- From the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass (A.Y., T.S., R.B.); and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, 55 Fruit St, WAC 240, Boston, Mass 02114-2698 (R.M., C.L.)
| | - Randy Miles
- From the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass (A.Y., T.S., R.B.); and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, 55 Fruit St, WAC 240, Boston, Mass 02114-2698 (R.M., C.L.)
| | - Regina Barzilay
- From the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass (A.Y., T.S., R.B.); and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, 55 Fruit St, WAC 240, Boston, Mass 02114-2698 (R.M., C.L.)
| | - Constance Lehman
- From the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass (A.Y., T.S., R.B.); and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, 55 Fruit St, WAC 240, Boston, Mass 02114-2698 (R.M., C.L.)
| |
Collapse
|
48
|
Tang R, Acevedo F, Lanahan C, Coopey SB, Yala A, Barzilay R, Li C, Colwell A, Guidi AJ, Cetrulo C, Garber J, Smith BL, Gadd MA, Specht MC, Hughes KS. Incidental breast carcinoma: incidence, management, and outcomes in 4804 bilateral reduction mammoplasties. Breast Cancer Res Treat 2019; 177:741-748. [PMID: 31317348 DOI: 10.1007/s10549-019-05335-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2019] [Accepted: 06/18/2019] [Indexed: 11/28/2022]
Abstract
INTRODUCTION Bilateral reduction mammoplasty is one of the most common plastic surgery procedures performed in the U.S. This study examines the incidence, management, and prognosis of incidental breast cancer identified in reduction specimens from a large cohort of reduction mammoplasty patients. METHODS Breast pathology reports were retrospectively reviewed for evidence of incidental cancers in bilateral reduction mammoplasty specimens from five institutions between 1990 and 2017. RESULTS A total of 4804 women met the inclusion criteria of this study; incidental cancer was identified in 45 breasts of 39 (0.8%) patients. Six patients (15%) had bilateral cancer. Overall, the maximum diagnosis by breast was 16 invasive cancers and 29 ductal carcinomas in situs. Thirty-three patients had unilateral cancer, 15 (45.5%) of which had high-risk lesions in the contralateral breast. Twenty-one patients underwent mastectomy (12 bilateral and nine unilateral), residual cancer was found in 10 in 25 (40%) therapeutic mastectomies. Seven patients did not undergo mastectomy received breast radiation. The median follow-up was 92 months. No local recurrences were observed in the patients undergoing mastectomy or radiation. Three of 11 (27%) patients who did not undergo mastectomy or radiation developed a local recurrence. The overall survival rate was 87.2% and disease-free survival was 82.1%. CONCLUSIONS Patients undergoing reduction mammoplasty for macromastia have a small but definite risk of incidental breast cancer. The high rate of bilateral cancer, contralateral high-risk lesions, and residual disease at mastectomy mandates thorough pathologic evaluation and careful follow-up of these patients. Mastectomy or breast radiation is recommended for local control given the high likelihood of local recurrence without either.
Collapse
Affiliation(s)
- Rong Tang
- Division of Surgical Oncology, Massachusetts General Hospital, Boston, MA, 02114, USA
| | - Francisco Acevedo
- Division of Surgical Oncology, Massachusetts General Hospital, Boston, MA, 02114, USA
| | - Conor Lanahan
- Division of Surgical Oncology, Massachusetts General Hospital, Boston, MA, 02114, USA
| | - Suzanne B Coopey
- Division of Surgical Oncology, Massachusetts General Hospital, Boston, MA, 02114, USA
| | - Adam Yala
- Department of Electrical Engineering and Computer Science, CSAIL MIT, Cambridge, 02142, USA
| | - Regina Barzilay
- Department of Electrical Engineering and Computer Science, CSAIL MIT, Cambridge, 02142, USA
| | - Clara Li
- Department of Electrical Engineering and Computer Science, CSAIL MIT, Cambridge, 02142, USA
| | - Amy Colwell
- Plastic Surgery, Massachusetts General Hospital, Boston, MA, 02114, USA
| | - Anthony J Guidi
- Department of Pathology, Newton-Wellesley Hospital, Newton, MA, 02462, USA
| | - Curtis Cetrulo
- Plastic Surgery, Massachusetts General Hospital, Boston, MA, 02114, USA
| | - Judy Garber
- Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA
| | - Barbara L Smith
- Division of Surgical Oncology, Massachusetts General Hospital, Boston, MA, 02114, USA
| | - Michele A Gadd
- Division of Surgical Oncology, Massachusetts General Hospital, Boston, MA, 02114, USA
| | - Michelle C Specht
- Division of Surgical Oncology, Massachusetts General Hospital, Boston, MA, 02114, USA
| | - Kevin S Hughes
- Division of Surgical Oncology, Massachusetts General Hospital, Boston, MA, 02114, USA.
- Harvard Medical School, Boston, MA, 02115, USA.
| |
Collapse
|
49
|
Acevedo F, Armengol VD, Deng Z, Tang R, Coopey S, Mazzola E, Lanahan C, Braun D, Yala A, Barzilay R, Li C, Santus E, Colwell A, Guidi A, Cetrulo C, Garber JE, Smith BL, King TA, Hughes KS. Incidental atypical hyperplasia/LCIS in mammoplasty specimens and subsequent risk of breast cancer. J Clin Oncol 2019. [DOI: 10.1200/jco.2019.37.15_suppl.1561] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
1561 Background: Proliferative breast lesions with atypia (atypical hyperplasia and lobular carcinoma in-situ (LCIS)) increase the risk of breast cancer (BC). Most cases are diagnosed in the context of an abnormal mammogram. Little is known about BC risk for patients with these lesions who are asymptomatic. Mammoplasty specimens allow us to study breast tissue in asymptomatic healthy women. We previously published the rate of atypia in the largest reported mammoplasty cohort. The aim of this study is to examine the risk of BC in the atypia cohort. Methods: Breast pathology reports were retrospectively reviewed for evidence of atypical ductal hyperplasia (ADH), atypical lobular hyperplasia (ALH) or LCIS in bilateral reduction mammoplasty specimens from five institutions within a single healthcare system between 1990 to 2017. Patients with prior or concurrent BC or prior atypia were excluded. Data was extracted from electronic medical records using natural language processing and manual review to assess subsequent risk of BC. Results: From our mammoplasty cohort of 4771 patients, 295 patients were found to have atypia (6.2%) at baseline. 40 of these patients were lost to follow-up and excluded from the study. For the remaining 255 patients, 13 had severe ADH bordering on ductal carcinoma in situ, 52 had LCIS, 119 had ALH, and 71 had ADH at baseline. The median age at baseline was 52.1 (range 17.9 – 74.3). With a median follow-up of 7.7 years, of the 255 patients 9 patients developed BC (8 invasive carcinomas, 1 ductal carcinoma in situ). 81.3% of the cohort did not receive chemoprevention. Only one patient out of the nine who developed BC received chemoprevention. The risk of developing BC among women with atypia at baseline was 0.5%, 2.9% and 4.1%, at 3, 5 and 10 years respectively. Conclusions: Patients with asymptomatic atypias found in reduction mammoplasty specimens appear to be at lower risk of developing BC than those diagnosed with atypia in the context of an abnormal mammogram. These results may provide guidance on how to manage this group of patients related to future screening and/or chemoprevention.
Collapse
Affiliation(s)
| | | | | | - Rong Tang
- Massachusetts General Hospital, Boston, MA
| | | | | | | | | | - Adam Yala
- Massachusetts Institute of Technology, Cambridge, MA
| | | | - Clara Li
- Massachusetts General Hospital, Boston, MA
| | | | | | | | | | - Judy Ellen Garber
- Center for Cancer Genetics and Prevention, Dana-Farber Cancer Institute, Boston, MA
| | | | - Tari A. King
- Breast Oncology Program, Dana-Farber/Brigham and Women’s Cancer Center, Boston, MA
| | | |
Collapse
|
50
|
Arbour KC, Anh Tuan L, Rizvi H, Yala A, Hellmann MD, Barzilay R. ml-RECIST: Machine learning to estimate RECIST in patients with NSCLC treated with PD-(L)1 blockade. J Clin Oncol 2019. [DOI: 10.1200/jco.2019.37.15_suppl.9052] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
9052 Background: Real-world evidence (RWE) is increasingly important for discovery and may be an opportunity for regulatory approval. Effective use of RWE relies on determining treatment-specific outcomes, such as overall response rate (ORR) and progression-free survival (PFS), which are challenging to accurately evaluate retrospectively and at scale. We hypothesized the use of machine learning of text radiology reports from patients with NSCLC treated with PD-1 blockade could be used to train a model that estimates RECIST-defined outcomes. Methods: 2753 imaging reports from 453 patients with advanced NSCLC treated with PD-1 blockade were collected and separated into independent training (80%, n = 362) and validation (20%, n = 92) cohorts. Reports were limited to interval of PD-1 blockade. RECIST reads performed by thoracic radiologists on all patients served as “gold standard” to determine ORR, occurrence of, and date of progression. Baseline reports were compared to all follow up reports to determine machine-learning RECIST (ml-RECIST). A four layers neural-network model for classification was proposed to solve the three above tasks. Results: In the training cohort, ml-RECIST best estimated ORR by RECIST (accuracy CR/PR 84%, SD 82%, POD 91%). ml-RECIST estimated PFS by RECIST accurately predicting progression occurred at any time (86%) and exact progression date (65%). Date of progression was closely correlated (Pearson’s r coefficient = 0.91, 95% CI:0.89-0.94, p < 0.001) in patients determined to have progressed by both methods. Similar accuracy of ml-RECIST was observed in the validation cohort (accuracy CR/PR 84%, SD 80%, POD 90%; progression occurred 86%, progression date 72%). Accuracy was consistent when RECIST reads were performed prospectively as part of clinical trials vs retrospectively for standard of care treatment (e.g. CR/PR 82% vs 88%, respectively). ml-RECIST-defined response similarly determined improvement in overall survival compared to RECIST (HR = 0.19, p < 0.001 vs HR = 0.26, p < 0.001 respectively). Conclusions: Machine learning-RECIST ("ml-RECIST") accurately estimates outcomes using imaging text reports. ml-RECIST may be tool to determine outcomes expeditiously and at scale for use in RWE studies, enabling more useful and reliable applications of large clinical databases.
Collapse
Affiliation(s)
| | - Luu Anh Tuan
- Massachusetts Institute of Technology, Cambridge, MA
| | - Hira Rizvi
- Memorial Sloan Kettering Cancer Center, New York, NY
| | - Adam Yala
- Massachusetts Institute of Technology, Cambridge, MA
| | | | | |
Collapse
|