1
|
Coroller T, Sahiner B, Amatya A, Gossmann A, Karagiannis K, Moloney C, Samala RK, Santana-Quintero L, Solovieff N, Wang C, Amiri-Kordestani L, Cao Q, Cha KH, Charlab R, Cross FH, Hu T, Huang R, Kraft J, Krusche P, Li Y, Li Z, Mazo I, Paul R, Schnakenberg S, Serra P, Smith S, Song C, Su F, Tiwari M, Vechery C, Xiong X, Zarate JP, Zhu H, Chakravartty A, Liu Q, Ohlssen D, Petrick N, Schneider JA, Walderhaug M, Zuber E. Methodology for Good Machine Learning with Multi-Omics Data. Clin Pharmacol Ther 2024; 115:745-757. [PMID: 37965805 DOI: 10.1002/cpt.3105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 10/20/2023] [Indexed: 11/16/2023]
Abstract
In 2020, Novartis Pharmaceuticals Corporation and the U.S. Food and Drug Administration (FDA) started a 4-year scientific collaboration to approach complex new data modalities and advanced analytics. The scientific question was to find novel radio-genomics-based prognostic and predictive factors for HR+/HER- metastatic breast cancer under a Research Collaboration Agreement. This collaboration has been providing valuable insights to help successfully implement future scientific projects, particularly using artificial intelligence and machine learning. This tutorial aims to provide tangible guidelines for a multi-omics project that includes multidisciplinary expert teams, spanning across different institutions. We cover key ideas, such as "maintaining effective communication" and "following good data science practices," followed by the four steps of exploratory projects, namely (1) plan, (2) design, (3) develop, and (4) disseminate. We break each step into smaller concepts with strategies for implementation and provide illustrations from our collaboration to further give the readers actionable guidance.
Collapse
Affiliation(s)
| | - Berkman Sahiner
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Anup Amatya
- Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Alexej Gossmann
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Konstantinos Karagiannis
- Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | | | - Ravi K Samala
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Luis Santana-Quintero
- Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Nadia Solovieff
- Novartis Pharmaceutical Company, East Hanover, New Jersey, USA
| | - Craig Wang
- Novartis Pharma AG, Rotkreuz, Switzerland
| | - Laleh Amiri-Kordestani
- Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Qian Cao
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Kenny H Cha
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Rosane Charlab
- Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Frank H Cross
- Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Tingting Hu
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Ruihao Huang
- Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Jeffrey Kraft
- Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | | | - Yutong Li
- Novartis Pharmaceutical Company, East Hanover, New Jersey, USA
| | - Zheng Li
- Novartis Pharmaceutical Company, East Hanover, New Jersey, USA
| | - Ilya Mazo
- Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Rahul Paul
- Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | | | - Paolo Serra
- Novartis Pharmaceutical Company, East Hanover, New Jersey, USA
| | - Sean Smith
- Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Chi Song
- Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Fei Su
- Novartis Pharmaceutical Company, East Hanover, New Jersey, USA
| | - Mohit Tiwari
- Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Colin Vechery
- Novartis Pharmaceutical Company, East Hanover, New Jersey, USA
| | - Xin Xiong
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | | | - Hao Zhu
- Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | | | - Qi Liu
- Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - David Ohlssen
- Novartis Pharmaceutical Company, East Hanover, New Jersey, USA
| | - Nicholas Petrick
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Julie A Schneider
- Oncology Center of Excellence, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Mark Walderhaug
- Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | | |
Collapse
|
2
|
Drukker K, Sahiner B, Hu T, Kim GH, Whitney HM, Baughan N, Myers KJ, Giger ML, McNitt-Gray M. MIDRC-MetricTree: a decision tree-based tool for recommending performance metrics in artificial intelligence-assisted medical image analysis. J Med Imaging (Bellingham) 2024; 11:024504. [PMID: 38576536 PMCID: PMC10990563 DOI: 10.1117/1.jmi.11.2.024504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Revised: 02/16/2024] [Accepted: 03/18/2024] [Indexed: 04/06/2024] Open
Abstract
Purpose The Medical Imaging and Data Resource Center (MIDRC) was created to facilitate medical imaging machine learning (ML) research for tasks including early detection, diagnosis, prognosis, and assessment of treatment response related to the coronavirus disease 2019 pandemic and beyond. The purpose of this work was to create a publicly available metrology resource to assist researchers in evaluating the performance of their medical image analysis ML algorithms. Approach An interactive decision tree, called MIDRC-MetricTree, has been developed, organized by the type of task that the ML algorithm was trained to perform. The criteria for this decision tree were that (1) users can select information such as the type of task, the nature of the reference standard, and the type of the algorithm output and (2) based on the user input, recommendations are provided regarding appropriate performance evaluation approaches and metrics, including literature references and, when possible, links to publicly available software/code as well as short tutorial videos. Results Five types of tasks were identified for the decision tree: (a) classification, (b) detection/localization, (c) segmentation, (d) time-to-event (TTE) analysis, and (e) estimation. As an example, the classification branch of the decision tree includes two-class (binary) and multiclass classification tasks and provides suggestions for methods, metrics, software/code recommendations, and literature references for situations where the algorithm produces either binary or non-binary (e.g., continuous) output and for reference standards with negligible or non-negligible variability and unreliability. Conclusions The publicly available decision tree is a resource to assist researchers in conducting task-specific performance evaluations, including classification, detection/localization, segmentation, TTE, and estimation tasks.
Collapse
Affiliation(s)
- Karen Drukker
- University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | - Berkman Sahiner
- U.S. Food and Drug Administration, Bethesda, Maryland, United States
| | - Tingting Hu
- U.S. Food and Drug Administration, Bethesda, Maryland, United States
| | - Grace Hyun Kim
- University of California Los Angeles, Los Angeles, California, United States
| | - Heather M. Whitney
- University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | - Natalie Baughan
- University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | | | - Maryellen L. Giger
- University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | - Michael McNitt-Gray
- University of California Los Angeles, Los Angeles, California, United States
| |
Collapse
|
3
|
Mahmood U, Shukla-Dave A, Chan HP, Drukker K, Samala RK, Chen Q, Vergara D, Greenspan H, Petrick N, Sahiner B, Huo Z, Summers RM, Cha KH, Tourassi G, Deserno TM, Grizzard KT, Näppi JJ, Yoshida H, Regge D, Mazurchuk R, Suzuki K, Morra L, Huisman H, Armato SG, Hadjiiski L. Artificial intelligence in medicine: mitigating risks and maximizing benefits via quality assurance, quality control, and acceptance testing. BJR Artif Intell 2024; 1:ubae003. [PMID: 38476957 PMCID: PMC10928809 DOI: 10.1093/bjrai/ubae003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 01/08/2024] [Accepted: 01/12/2024] [Indexed: 03/14/2024]
Abstract
The adoption of artificial intelligence (AI) tools in medicine poses challenges to existing clinical workflows. This commentary discusses the necessity of context-specific quality assurance (QA), emphasizing the need for robust QA measures with quality control (QC) procedures that encompass (1) acceptance testing (AT) before clinical use, (2) continuous QC monitoring, and (3) adequate user training. The discussion also covers essential components of AT and QA, illustrated with real-world examples. We also highlight what we see as the shared responsibility of manufacturers or vendors, regulators, healthcare systems, medical physicists, and clinicians to enact appropriate testing and oversight to ensure a safe and equitable transformation of medicine through AI.
Collapse
Affiliation(s)
- Usman Mahmood
- Department of Medical Physics, Memorial Sloan-Kettering Cancer Center, New York, NY, 10065, United States
| | - Amita Shukla-Dave
- Department of Medical Physics, Memorial Sloan-Kettering Cancer Center, New York, NY, 10065, United States
- Department of Radiology, Memorial Sloan-Kettering Cancer Center, New York, NY, 10065, United States
| | - Heang-Ping Chan
- Department of Radiology, University of Michigan, Ann Arbor, MI, 48109, United States
| | - Karen Drukker
- Department of Radiology, University of Chicago, Chicago, IL, 60637, United States
| | - Ravi K Samala
- Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD, 20993, United States
| | - Quan Chen
- Department of Radiation Oncology, Mayo Clinic Arizona, Phoenix, AZ, 85054, United States
| | - Daniel Vergara
- Department of Radiology, University of Washington, Seattle, WA, 98195, United States
| | - Hayit Greenspan
- Biomedical Engineering and Imaging Institute, Department of Radiology, Icahn School of Medicine at Mt Sinai, New York, NY, 10029, United States
| | - Nicholas Petrick
- Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD, 20993, United States
| | - Berkman Sahiner
- Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD, 20993, United States
| | - Zhimin Huo
- Tencent America, Palo Alto, CA, 94306, United States
| | - Ronald M Summers
- Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD, 20892, United States
| | - Kenny H Cha
- Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD, 20993, United States
| | - Georgia Tourassi
- Computing and Computational Sciences Directorate, Oak Ridge National Laboratory, Oak Ridge, TN, 37830, United States
| | - Thomas M Deserno
- Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Braunschweig, Niedersachsen, 38106, Germany
| | - Kevin T Grizzard
- Department of Radiology and Biomedical Imaging, Yale University School of Medicine, New Haven, CT, 06510, United States
| | - Janne J Näppi
- 3D Imaging Research, Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, 02114, United States
| | - Hiroyuki Yoshida
- 3D Imaging Research, Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, 02114, United States
| | - Daniele Regge
- Radiology Unit, Candiolo Cancer Institute, FPO-IRCCS, Candiolo, 10060, Italy
- Department of Translational Research and of New Surgical and Medical Technologies, University of Pisa, Pisa, 56126, Italy
| | - Richard Mazurchuk
- Division of Cancer Prevention, National Cancer Institute, National Institutes of Health, Bethesda, MD, 20892, United States
| | - Kenji Suzuki
- Institute of Innovative Research, Tokyo Institute of Technology, Midori-ku, Yokohama, Kanagawa, 226-8503, Japan
| | - Lia Morra
- Department of Control and Computer Engineering, Politecnico di Torino, Torino, Piemonte, 10129, Italy
| | - Henkjan Huisman
- Radboud Institute for Health Sciences, Radboud University Medical Center, Nijmegen, Gelderland, 6525 GA, Netherlands
| | - Samuel G Armato
- Department of Radiology, University of Chicago, Chicago, IL, 60637, United States
| | - Lubomir Hadjiiski
- Department of Radiology, University of Michigan, Ann Arbor, MI, 48109, United States
| |
Collapse
|
4
|
Burgon A, Sahiner B, Petrick N, Pennello G, Cha KH, Samala RK. Decision region analysis for generalizability of artificial intelligence models: estimating model generalizability in the case of cross-reactivity and population shift. J Med Imaging (Bellingham) 2024; 11:014501. [PMID: 38283653 PMCID: PMC10810180 DOI: 10.1117/1.jmi.11.1.014501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 12/14/2023] [Accepted: 12/28/2023] [Indexed: 01/30/2024] Open
Abstract
Purpose Understanding an artificial intelligence (AI) model's ability to generalize to its target population is critical to ensuring the safe and effective usage of AI in medical devices. A traditional generalizability assessment relies on the availability of large, diverse datasets, which are difficult to obtain in many medical imaging applications. We present an approach for enhanced generalizability assessment by examining the decision space beyond the available testing data distribution. Approach Vicinal distributions of virtual samples are generated by interpolating between triplets of test images. The generated virtual samples leverage the characteristics already in the test set, increasing the sample diversity while remaining close to the AI model's data manifold. We demonstrate the generalizability assessment approach on the non-clinical tasks of classifying patient sex, race, COVID status, and age group from chest x-rays. Results Decision region composition analysis for generalizability indicated that a disproportionately large portion of the decision space belonged to a single "preferred" class for each task, despite comparable performance on the evaluation dataset. Evaluation using cross-reactivity and population shift strategies indicated a tendency to overpredict samples as belonging to the preferred class (e.g., COVID negative) for patients whose subgroup was not represented in the model development data. Conclusions An analysis of an AI model's decision space has the potential to provide insight into model generalizability. Our approach uses the analysis of composition of the decision space to obtain an improved assessment of model generalizability in the case of limited test data.
Collapse
Affiliation(s)
- Alexis Burgon
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Science and Engineering Laboratories, Division of Imaging, Diagnostics, and Software Reliability, Silver Spring, Maryland, United States
| | - Berkman Sahiner
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Science and Engineering Laboratories, Division of Imaging, Diagnostics, and Software Reliability, Silver Spring, Maryland, United States
| | - Nicholas Petrick
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Science and Engineering Laboratories, Division of Imaging, Diagnostics, and Software Reliability, Silver Spring, Maryland, United States
| | - Gene Pennello
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Science and Engineering Laboratories, Division of Imaging, Diagnostics, and Software Reliability, Silver Spring, Maryland, United States
| | - Kenny H. Cha
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Science and Engineering Laboratories, Division of Imaging, Diagnostics, and Software Reliability, Silver Spring, Maryland, United States
| | - Ravi K. Samala
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Science and Engineering Laboratories, Division of Imaging, Diagnostics, and Software Reliability, Silver Spring, Maryland, United States
| |
Collapse
|
5
|
Whitney HM, Baughan N, Myers KJ, Drukker K, Gichoya J, Bower B, Chen W, Gruszauskas N, Kalpathy-Cramer J, Koyejo S, Sá RC, Sahiner B, Zhang Z, Giger ML. Longitudinal assessment of demographic representativeness in the Medical Imaging and Data Resource Center open data commons. J Med Imaging (Bellingham) 2023; 10:61105. [PMID: 37469387 PMCID: PMC10353566 DOI: 10.1117/1.jmi.10.6.061105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 06/21/2023] [Accepted: 06/23/2023] [Indexed: 07/21/2023] Open
Abstract
Purpose The Medical Imaging and Data Resource Center (MIDRC) open data commons was launched to accelerate the development of artificial intelligence (AI) algorithms to help address the COVID-19 pandemic. The purpose of this study was to quantify longitudinal representativeness of the demographic characteristics of the primary MIDRC dataset compared to the United States general population (US Census) and COVID-19 positive case counts from the Centers for Disease Control and Prevention (CDC). Approach The Jensen-Shannon distance (JSD), a measure of similarity of two distributions, was used to longitudinally measure the representativeness of the distribution of (1) all unique patients in the MIDRC data to the 2020 US Census and (2) all unique COVID-19 positive patients in the MIDRC data to the case counts reported by the CDC. The distributions were evaluated in the demographic categories of age at index, sex, race, ethnicity, and the combination of race and ethnicity. Results Representativeness of the MIDRC data by ethnicity and the combination of race and ethnicity was impacted by the percentage of CDC case counts for which this was not reported. The distributions by sex and race have retained their level of representativeness over time. Conclusion The representativeness of the open medical imaging datasets in the curated public data commons at MIDRC has evolved over time as the number of contributing institutions and overall number of subjects have grown. The use of metrics, such as the JSD support measurement of representativeness, is one step needed for fair and generalizable AI algorithm development.
Collapse
Affiliation(s)
- Heather M. Whitney
- University of Chicago, Chicago, Illinois, United States
- The Medical Imaging and Data Resource Center (midrc.org)
| | - Natalie Baughan
- University of Chicago, Chicago, Illinois, United States
- The Medical Imaging and Data Resource Center (midrc.org)
| | - Kyle J. Myers
- The Medical Imaging and Data Resource Center (midrc.org)
- Puente Solutions LLC, Phoenix, Arizona, United States
| | - Karen Drukker
- University of Chicago, Chicago, Illinois, United States
- The Medical Imaging and Data Resource Center (midrc.org)
| | - Judy Gichoya
- The Medical Imaging and Data Resource Center (midrc.org)
- Emory University, Atlanta, Georgia, United States
| | - Brad Bower
- The Medical Imaging and Data Resource Center (midrc.org)
- National Institutes of Health, Bethesda, Maryland, United States
| | - Weijie Chen
- The Medical Imaging and Data Resource Center (midrc.org)
- United States Food and Drug Administration, Silver Spring, Maryland, United States
| | - Nicholas Gruszauskas
- University of Chicago, Chicago, Illinois, United States
- The Medical Imaging and Data Resource Center (midrc.org)
| | - Jayashree Kalpathy-Cramer
- The Medical Imaging and Data Resource Center (midrc.org)
- University of Colorado Anschutz Medical Campus, Aurora, Colorado, United States
| | - Sanmi Koyejo
- The Medical Imaging and Data Resource Center (midrc.org)
- Stanford University, Stanford, California, United States
| | - Rui C. Sá
- The Medical Imaging and Data Resource Center (midrc.org)
- National Institutes of Health, Bethesda, Maryland, United States
- University of California, San Diego, La Jolla, California, United States
| | - Berkman Sahiner
- The Medical Imaging and Data Resource Center (midrc.org)
- United States Food and Drug Administration, Silver Spring, Maryland, United States
| | - Zi Zhang
- The Medical Imaging and Data Resource Center (midrc.org)
- Jefferson Health, Philadelphia, Pennsylvania, United States
| | - Maryellen L. Giger
- University of Chicago, Chicago, Illinois, United States
- The Medical Imaging and Data Resource Center (midrc.org)
| |
Collapse
|
6
|
Drukker K, Chen W, Gichoya J, Gruszauskas N, Kalpathy-Cramer J, Koyejo S, Myers K, Sá RC, Sahiner B, Whitney H, Zhang Z, Giger M. Toward fairness in artificial intelligence for medical image analysis: identification and mitigation of potential biases in the roadmap from data collection to model deployment. J Med Imaging (Bellingham) 2023; 10:061104. [PMID: 37125409 PMCID: PMC10129875 DOI: 10.1117/1.jmi.10.6.061104] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 04/03/2023] [Indexed: 05/02/2023] Open
Abstract
Purpose To recognize and address various sources of bias essential for algorithmic fairness and trustworthiness and to contribute to a just and equitable deployment of AI in medical imaging, there is an increasing interest in developing medical imaging-based machine learning methods, also known as medical imaging artificial intelligence (AI), for the detection, diagnosis, prognosis, and risk assessment of disease with the goal of clinical implementation. These tools are intended to help improve traditional human decision-making in medical imaging. However, biases introduced in the steps toward clinical deployment may impede their intended function, potentially exacerbating inequities. Specifically, medical imaging AI can propagate or amplify biases introduced in the many steps from model inception to deployment, resulting in a systematic difference in the treatment of different groups. Approach Our multi-institutional team included medical physicists, medical imaging artificial intelligence/machine learning (AI/ML) researchers, experts in AI/ML bias, statisticians, physicians, and scientists from regulatory bodies. We identified sources of bias in AI/ML, mitigation strategies for these biases, and developed recommendations for best practices in medical imaging AI/ML development. Results Five main steps along the roadmap of medical imaging AI/ML were identified: (1) data collection, (2) data preparation and annotation, (3) model development, (4) model evaluation, and (5) model deployment. Within these steps, or bias categories, we identified 29 sources of potential bias, many of which can impact multiple steps, as well as mitigation strategies. Conclusions Our findings provide a valuable resource to researchers, clinicians, and the public at large.
Collapse
Affiliation(s)
- Karen Drukker
- The University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | - Weijie Chen
- US Food and Drug Administration, Division of Imaging, Diagnostics, and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Silver Spring, Maryland, United States
| | - Judy Gichoya
- Emory University, Department of Radiology, Atlanta, Georgia, United States
| | - Nicholas Gruszauskas
- The University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | | | - Sanmi Koyejo
- Stanford University, Department of Computer Science, Stanford, California, United States
| | - Kyle Myers
- Puente Solutions LLC, Phoenix, Arizona, United States
| | - Rui C. Sá
- National Institutes of Health, Bethesda, Maryland, United States
- University of California, San Diego, La Jolla, California, United States
| | - Berkman Sahiner
- US Food and Drug Administration, Division of Imaging, Diagnostics, and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Silver Spring, Maryland, United States
| | - Heather Whitney
- The University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | - Zi Zhang
- Jefferson Health, Philadelphia, Pennsylvania, United States
| | - Maryellen Giger
- The University of Chicago, Department of Radiology, Chicago, Illinois, United States
| |
Collapse
|
7
|
Baughan N, Whitney HM, Drukker K, Sahiner B, Hu T, Kim GH, McNitt-Gray M, Myers KJ, Giger ML. Sequestration of imaging studies in MIDRC: stratified sampling to balance demographic characteristics of patients in a multi-institutional data commons. J Med Imaging (Bellingham) 2023; 10:064501. [PMID: 38074627 PMCID: PMC10704184 DOI: 10.1117/1.jmi.10.6.064501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 10/23/2023] [Accepted: 10/25/2023] [Indexed: 02/12/2024] Open
Abstract
Purpose The Medical Imaging and Data Resource Center (MIDRC) is a multi-institutional effort to accelerate medical imaging machine intelligence research and create a publicly available image repository/commons as well as a sequestered commons for performance evaluation and benchmarking of algorithms. After de-identification, approximately 80% of the medical images and associated metadata become part of the open commons and 20% are sequestered from the open commons. To ensure that both commons are representative of the population available, we introduced a stratified sampling method to balance the demographic characteristics across the two datasets. Approach Our method uses multi-dimensional stratified sampling where several demographic variables of interest are sequentially used to separate the data into individual strata, each representing a unique combination of variables. Within each resulting stratum, patients are assigned to the open or sequestered commons. This algorithm was used on an example dataset containing 5000 patients using the variables of race, age, sex at birth, ethnicity, COVID-19 status, and image modality and compared resulting demographic distributions to naïve random sampling of the dataset over 2000 independent trials. Results Resulting prevalence of each demographic variable matched the prevalence from the input dataset within one standard deviation. Mann-Whitney U test results supported the hypothesis that sequestration by stratified sampling provided more balanced subsets than naïve randomization, except for demographic subcategories with very low prevalence. Conclusions The developed multi-dimensional stratified sampling algorithm can partition a large dataset while maintaining balance across several variables, superior to the balance achieved from naïve randomization.
Collapse
Affiliation(s)
- Natalie Baughan
- University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | - Heather M. Whitney
- University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | - Karen Drukker
- University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | - Berkman Sahiner
- US Food and Drug Administration, Bethesda, Maryland, United States
| | - Tingting Hu
- US Food and Drug Administration, Bethesda, Maryland, United States
| | - Grace Hyun Kim
- University of California, Los Angeles, Los Angeles, California, United States
| | - Michael McNitt-Gray
- University of California, Los Angeles, Los Angeles, California, United States
| | | | - Maryellen L. Giger
- University of Chicago, Department of Radiology, Chicago, Illinois, United States
| |
Collapse
|
8
|
Sahiner B, Chen W, Samala RK, Petrick N. Data drift in medical machine learning: implications and potential remedies. Br J Radiol 2023; 96:20220878. [PMID: 36971405 PMCID: PMC10546450 DOI: 10.1259/bjr.20220878] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 02/16/2023] [Accepted: 02/20/2023] [Indexed: 03/29/2023] Open
Abstract
Data drift refers to differences between the data used in training a machine learning (ML) model and that applied to the model in real-world operation. Medical ML systems can be exposed to various forms of data drift, including differences between the data sampled for training and used in clinical operation, differences between medical practices or context of use between training and clinical use, and time-related changes in patient populations, disease patterns, and data acquisition, to name a few. In this article, we first review the terminology used in ML literature related to data drift, define distinct types of drift, and discuss in detail potential causes within the context of medical applications with an emphasis on medical imaging. We then review the recent literature regarding the effects of data drift on medical ML systems, which overwhelmingly show that data drift can be a major cause for performance deterioration. We then discuss methods for monitoring data drift and mitigating its effects with an emphasis on pre- and post-deployment techniques. Some of the potential methods for drift detection and issues around model retraining when drift is detected are included. Based on our review, we find that data drift is a major concern in medical ML deployment and that more research is needed so that ML models can identify drift early, incorporate effective mitigation strategies and resist performance decay.
Collapse
Affiliation(s)
- Berkman Sahiner
- Center for Devices and Radiological Health, U.S. Food and Drug Administration 10903 New Hampshire Avenue, Silver Spring, MD 20993-0002
| | - Weijie Chen
- Center for Devices and Radiological Health, U.S. Food and Drug Administration 10903 New Hampshire Avenue, Silver Spring, MD 20993-0002
| | - Ravi K. Samala
- Center for Devices and Radiological Health, U.S. Food and Drug Administration 10903 New Hampshire Avenue, Silver Spring, MD 20993-0002
| | - Nicholas Petrick
- Center for Devices and Radiological Health, U.S. Food and Drug Administration 10903 New Hampshire Avenue, Silver Spring, MD 20993-0002
| |
Collapse
|
9
|
Petrick N, Chen W, Delfino JG, Gallas BD, Kang Y, Krainak D, Sahiner B, Samala RK. Regulatory considerations for medical imaging AI/ML devices in the United States: concepts and challenges. J Med Imaging (Bellingham) 2023; 10:051804. [PMID: 37361549 PMCID: PMC10289177 DOI: 10.1117/1.jmi.10.5.051804] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Revised: 05/22/2023] [Accepted: 05/30/2023] [Indexed: 06/28/2023] Open
Abstract
Purpose To introduce developers to medical device regulatory processes and data considerations in artificial intelligence and machine learning (AI/ML) device submissions and to discuss ongoing AI/ML-related regulatory challenges and activities. Approach AI/ML technologies are being used in an increasing number of medical imaging devices, and the fast evolution of these technologies presents novel regulatory challenges. We provide AI/ML developers with an introduction to U.S. Food and Drug Administration (FDA) regulatory concepts, processes, and fundamental assessments for a wide range of medical imaging AI/ML device types. Results The device type for an AI/ML device and appropriate premarket regulatory pathway is based on the level of risk associated with the device and informed by both its technological characteristics and intended use. AI/ML device submissions contain a wide array of information and testing to facilitate the review process with the model description, data, nonclinical testing, and multi-reader multi-case testing being critical aspects of the AI/ML device review process for many AI/ML device submissions. The agency is also involved in AI/ML-related activities that support guidance document development, good machine learning practice development, AI/ML transparency, AI/ML regulatory research, and real-world performance assessment. Conclusion FDA's AI/ML regulatory and scientific efforts support the joint goals of ensuring patients have access to safe and effective AI/ML devices over the entire device lifecycle and stimulating medical AI/ML innovation.
Collapse
Affiliation(s)
- Nicholas Petrick
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Science and Engineering Labs, Silver Spring, Maryland, United States
| | - Weijie Chen
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Science and Engineering Labs, Silver Spring, Maryland, United States
| | - Jana G. Delfino
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Science and Engineering Labs, Silver Spring, Maryland, United States
| | - Brandon D. Gallas
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Science and Engineering Labs, Silver Spring, Maryland, United States
| | - Yanna Kang
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Product Evaluation and Quality, Silver Spring, Maryland, United States
| | - Daniel Krainak
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Product Evaluation and Quality, Silver Spring, Maryland, United States
| | - Berkman Sahiner
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Science and Engineering Labs, Silver Spring, Maryland, United States
| | - Ravi K. Samala
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Office of Science and Engineering Labs, Silver Spring, Maryland, United States
| |
Collapse
|
10
|
Wang X, Sahiner B, Scully CG, Cha KH. AFE-GAN: Synthesizing Electrocardiograms with Atrial Fibrillation Characteristics Using Generative Adversarial Networks . Annu Int Conf IEEE Eng Med Biol Soc 2023; 2023:1-5. [PMID: 38083445 DOI: 10.1109/embc40787.2023.10340565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2023]
Abstract
Labeled ECG data in diseased state are, however, relatively scarce due to various concerns including patient privacy and low prevalence. We propose the first study in its kind that synthesizes atrial fibrillation (AF)-like ECG signals from normal ECG signals using the AFE-GAN, a generative adversarial network. Our AFE-GAN adjusts both beat morphology and rhythm variability when generating the atrial fibrillation-like ECG signals. Two publicly available arrhythmia detectors classified 72.4% and 77.2% of our generated signals as AF in a four-class (normal, AF, other abnormal, noisy) classification. This work shows the feasibility to synthesize abnormal ECG signals from normal ECG signals.Clinical significance - The AF ECG signal generated with our AFE-GAN has the potential to be used as training materials for health practitioners or be used as class-balance supplements for training automatic AF detectors.
Collapse
|
11
|
Hadjiiski L, Cha K, Chan HP, Drukker K, Morra L, Näppi JJ, Sahiner B, Yoshida H, Chen Q, Deserno TM, Greenspan H, Huisman H, Huo Z, Mazurchuk R, Petrick N, Regge D, Samala R, Summers RM, Suzuki K, Tourassi G, Vergara D, Armato SG. AAPM task group report 273: Recommendations on best practices for AI and machine learning for computer-aided diagnosis in medical imaging. Med Phys 2023; 50:e1-e24. [PMID: 36565447 DOI: 10.1002/mp.16188] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Revised: 11/13/2022] [Accepted: 11/22/2022] [Indexed: 12/25/2022] Open
Abstract
Rapid advances in artificial intelligence (AI) and machine learning, and specifically in deep learning (DL) techniques, have enabled broad application of these methods in health care. The promise of the DL approach has spurred further interest in computer-aided diagnosis (CAD) development and applications using both "traditional" machine learning methods and newer DL-based methods. We use the term CAD-AI to refer to this expanded clinical decision support environment that uses traditional and DL-based AI methods. Numerous studies have been published to date on the development of machine learning tools for computer-aided, or AI-assisted, clinical tasks. However, most of these machine learning models are not ready for clinical deployment. It is of paramount importance to ensure that a clinical decision support tool undergoes proper training and rigorous validation of its generalizability and robustness before adoption for patient care in the clinic. To address these important issues, the American Association of Physicists in Medicine (AAPM) Computer-Aided Image Analysis Subcommittee (CADSC) is charged, in part, to develop recommendations on practices and standards for the development and performance assessment of computer-aided decision support systems. The committee has previously published two opinion papers on the evaluation of CAD systems and issues associated with user training and quality assurance of these systems in the clinic. With machine learning techniques continuing to evolve and CAD applications expanding to new stages of the patient care process, the current task group report considers the broader issues common to the development of most, if not all, CAD-AI applications and their translation from the bench to the clinic. The goal is to bring attention to the proper training and validation of machine learning algorithms that may improve their generalizability and reliability and accelerate the adoption of CAD-AI systems for clinical decision support.
Collapse
Affiliation(s)
- Lubomir Hadjiiski
- Department of Radiology, University of Michigan, Ann Arbor, Michigan, USA
| | - Kenny Cha
- U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Heang-Ping Chan
- Department of Radiology, University of Michigan, Ann Arbor, Michigan, USA
| | - Karen Drukker
- Department of Radiology, University of Chicago, Chicago, Illinois, USA
| | - Lia Morra
- Department of Control and Computer Engineering, Politecnico di Torino, Torino, Italy
| | - Janne J Näppi
- 3D Imaging Research, Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
| | - Berkman Sahiner
- U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Hiroyuki Yoshida
- 3D Imaging Research, Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
| | - Quan Chen
- Department of Radiation Medicine, University of Kentucky, Lexington, Kentucky, USA
| | - Thomas M Deserno
- Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Braunschweig, Germany
| | - Hayit Greenspan
- Department of Biomedical Engineering, Faculty of Engineering, Tel Aviv, Israel & Department of Radiology, Ichan School of Medicine, Tel Aviv University, Mt Sinai, New York, New York, USA
| | - Henkjan Huisman
- Radboud Institute for Health Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Zhimin Huo
- Tencent America, Palo Alto, California, USA
| | - Richard Mazurchuk
- Division of Cancer Prevention, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, USA
| | | | - Daniele Regge
- Radiology Unit, Candiolo Cancer Institute, FPO-IRCCS, Candiolo, Italy.,Department of Surgical Sciences, University of Turin, Turin, Italy
| | - Ravi Samala
- U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Ronald M Summers
- Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, Maryland, USA
| | - Kenji Suzuki
- Institute of Innovative Research, Tokyo Institute of Technology, Tokyo, Japan
| | | | - Daniel Vergara
- Department of Radiology, Yale New Haven Hospital, New Haven, Connecticut, USA
| | - Samuel G Armato
- Department of Radiology, University of Chicago, Chicago, Illinois, USA
| |
Collapse
|
12
|
Maynord M, Farhangi MM, Fermüller C, Aloimonos Y, Levine G, Petrick N, Sahiner B, Pezeshk A. Semi-supervised training using cooperative labeling of weakly annotated data for nodule detection in chest CT. Med Phys 2023. [PMID: 36630691 DOI: 10.1002/mp.16219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 12/14/2022] [Accepted: 12/23/2022] [Indexed: 01/13/2023] Open
Abstract
PURPOSE Machine learning algorithms are best trained with large quantities of accurately annotated samples. While natural scene images can often be labeled relatively cheaply and at large scale, obtaining accurate annotations for medical images is both time consuming and expensive. In this study, we propose a cooperative labeling method that allows us to make use of weakly annotated medical imaging data for the training of a machine learning algorithm. As most clinically produced data are weakly-annotated - produced for use by humans rather than machines and lacking information machine learning depends upon - this approach allows us to incorporate a wider range of clinical data and thereby increase the training set size. METHODS Our pseudo-labeling method consists of multiple stages. In the first stage, a previously established network is trained using a limited number of samples with high-quality expert-produced annotations. This network is used to generate annotations for a separate larger dataset that contains only weakly annotated scans. In the second stage, by cross-checking the two types of annotations against each other, we obtain higher-fidelity annotations. In the third stage, we extract training data from the weakly annotated scans, and combine it with the fully annotated data, producing a larger training dataset. We use this larger dataset to develop a computer-aided detection (CADe) system for nodule detection in chest CT. RESULTS We evaluated the proposed approach by presenting the network with different numbers of expert-annotated scans in training and then testing the CADe using an independent expert-annotated dataset. We demonstrate that when availability of expert annotations is severely limited, the inclusion of weakly-labeled data leads to a 5% improvement in the competitive performance metric (CPM), defined as the average of sensitivities at different false-positive rates. CONCLUSIONS Our proposed approach can effectively merge a weakly-annotated dataset with a small, well-annotated dataset for algorithm training. This approach can help enlarge limited training data by leveraging the large amount of weakly labeled data typically generated in clinical image interpretation.
Collapse
Affiliation(s)
- Michael Maynord
- University of Maryland, Computer Science Department, Iribe Center for Computer Science and Engineering, College Park, Maryland, USA.,Division of Imaging, Diagnostics, and Software Reliability (DIDSR), OSEL, CDRH, FDA, Silver Spring, Maryland, USA
| | - M Mehdi Farhangi
- Division of Imaging, Diagnostics, and Software Reliability (DIDSR), OSEL, CDRH, FDA, Silver Spring, Maryland, USA
| | - Cornelia Fermüller
- University of Maryland, Institute for Advanced Computer Studies, Iribe Center for Computer Science and Engineering, College Park, Maryland, USA
| | - Yiannis Aloimonos
- University of Maryland, Computer Science Department, Iribe Center for Computer Science and Engineering, College Park, Maryland, USA
| | - Gary Levine
- Division of Radiological Imaging Devices and Electronic Products, CDRH, FDA, Silver Spring, Maryland, USA
| | - Nicholas Petrick
- Division of Imaging, Diagnostics, and Software Reliability (DIDSR), OSEL, CDRH, FDA, Silver Spring, Maryland, USA
| | - Berkman Sahiner
- Division of Imaging, Diagnostics, and Software Reliability (DIDSR), OSEL, CDRH, FDA, Silver Spring, Maryland, USA
| | - Aria Pezeshk
- Division of Imaging, Diagnostics, and Software Reliability (DIDSR), OSEL, CDRH, FDA, Silver Spring, Maryland, USA
| |
Collapse
|
13
|
Feng J, Gossmann A, Sahiner B, Pirracchio R. Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees. J Am Med Inform Assoc 2022; 29:841-852. [PMID: 35022756 DOI: 10.1093/jamia/ocab280] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 10/25/2021] [Accepted: 12/07/2021] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE After deploying a clinical prediction model, subsequently collected data can be used to fine-tune its predictions and adapt to temporal shifts. Because model updating carries risks of over-updating/fitting, we study online methods with performance guarantees. MATERIALS AND METHODS We introduce 2 procedures for continual recalibration or revision of an underlying prediction model: Bayesian logistic regression (BLR) and a Markov variant that explicitly models distribution shifts (MarBLR). We perform empirical evaluation via simulations and a real-world study predicting Chronic Obstructive Pulmonary Disease (COPD) risk. We derive "Type I and II" regret bounds, which guarantee the procedures are noninferior to a static model and competitive with an oracle logistic reviser in terms of the average loss. RESULTS Both procedures consistently outperformed the static model and other online logistic revision methods. In simulations, the average estimated calibration index (aECI) of the original model was 0.828 (95%CI, 0.818-0.938). Online recalibration using BLR and MarBLR improved the aECI towards the ideal value of zero, attaining 0.265 (95%CI, 0.230-0.300) and 0.241 (95%CI, 0.216-0.266), respectively. When performing more extensive logistic model revisions, BLR and MarBLR increased the average area under the receiver-operating characteristic curve (aAUC) from 0.767 (95%CI, 0.765-0.769) to 0.800 (95%CI, 0.798-0.802) and 0.799 (95%CI, 0.797-0.801), respectively, in stationary settings and protected against substantial model decay. In the COPD study, BLR and MarBLR dynamically combined the original model with a continually refitted gradient boosted tree to achieve aAUCs of 0.924 (95%CI, 0.913-0.935) and 0.925 (95%CI, 0.914-0.935), compared to the static model's aAUC of 0.904 (95%CI, 0.892-0.916). DISCUSSION Despite its simplicity, BLR is highly competitive with MarBLR. MarBLR outperforms BLR when its prior better reflects the data. CONCLUSIONS BLR and MarBLR can improve the transportability of clinical prediction models and maintain their performance over time.
Collapse
Affiliation(s)
- Jean Feng
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, California, USA
| | - Alexej Gossmann
- CDRH-Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland, USA
| | - Berkman Sahiner
- CDRH-Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland, USA
| | - Romain Pirracchio
- Department of Anesthesia and Perioperative Care, University of California, San Francisco, San Francisco, California, USA
| |
Collapse
|
14
|
El Naqa I, Boone JM, Benedict SH, Goodsitt MM, Chan HP, Drukker K, Hadjiiski L, Ruan D, Sahiner B. AI in medical physics: guidelines for publication. Med Phys 2021; 48:4711-4714. [PMID: 34545957 DOI: 10.1002/mp.15170] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 08/10/2021] [Accepted: 08/10/2021] [Indexed: 12/16/2022] Open
Abstract
The Abstract is intended to provide a concise summary of the study and its scientific findings. For AI/ML applications in medical physics, a problem statement and rationale for utilizing these algorithms are necessary while highlighting the novelty of the approach. A brief numerical description of how the data are partitioned into subsets for training of the AI/ML algorithm, validation (including tuning of parameters), and independent testing of algorithm performance is required. This is to be followed by a summary of the results and statistical metrics that quantify the performance of the AI/ML algorithm.
Collapse
Affiliation(s)
- Issam El Naqa
- Machine Learning & Radiation Oncology, Moffitt Cancer Center, 12902 Magnolia Drive, Tampa, FL, 33612, USA
| | - John M Boone
- Department of Radiology, University of California Davis Health, Sacramento, CA, 95817, USA
| | - Stanley H Benedict
- Radiation Oncology, University of California Davis Health, Sacramento, CA, 95817, USA
| | - Mitchell M Goodsitt
- Department of Radiology, University Michigan, 1500 E Medical Center Dr, Ann Arbor, MI, 48109, USA
| | - Heang-Ping Chan
- Department of Radiology, University Michigan, 1500 E Medical Center Dr, Ann Arbor, MI, 48109, USA
| | - Karen Drukker
- Department of Radiology, University of Chicago, 5841 S. Maryland Ave, Chicago, IL, 60637, USA
| | - Lubomir Hadjiiski
- Department of Radiology, University Michigan, 1500 E Medical Center Dr, Ann Arbor, MI, 48109, USA
| | - Dan Ruan
- Radiation Oncology, University of California Los Angeles School of Medicine, 200 UCLA Medical Plaza, Los Angeles, CA, 90095, USA
| | - Berkman Sahiner
- Food and Drug Administration, 10903 New Hampshire Ave., Silver Spring, MD, 20993, USA
| |
Collapse
|
15
|
Farhangi MM, Sahiner B, Petrick N, Pezeshk A. Automatic lung nodule detection in thoracic CT scans using dilated slice-wise convolutions. Med Phys 2021; 48:3741-3751. [PMID: 33932241 DOI: 10.1002/mp.14915] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Revised: 04/08/2021] [Accepted: 04/15/2021] [Indexed: 12/24/2022] Open
Abstract
PURPOSE Most state-of-the-art automated medical image analysis methods for volumetric data rely on adaptations of two-dimensional (2D) and three-dimensional (3D) convolutional neural networks (CNNs). In this paper, we develop a novel unified CNN-based model that combines the benefits of 2D and 3D networks for analyzing volumetric medical images. METHODS In our proposed framework, multiscale contextual information is first extracted from 2D slices inside a volume of interest (VOI). This is followed by dilated 1D convolutions across slices to aggregate in-plane features in a slice-wise manner and encode the information in the entire volume. Moreover, we formalize a curriculum learning strategy for a two-stage system (i.e., a system that consists of screening and false positive reduction), where the training samples are presented to the network in a meaningful order to further improve the performance. RESULTS We evaluated the proposed approach by developing a computer-aided detection (CADe) system for lung nodules. Our results on 888 CT exams demonstrate that the proposed approach can effectively analyze volumetric data by achieving a sensitivity of > 0.99 in the screening stage and a sensitivity of > 0.96 at eight false positives per case in the false positive reduction stage. CONCLUSION Our experimental results show that the proposed method provides competitive results compared to state-of-the-art 3D frameworks. In addition, we illustrate the benefits of curriculum learning strategies in two-stage systems that are of common use in medical imaging applications.
Collapse
Affiliation(s)
- M Mehdi Farhangi
- Division of Imaging, Diagnostics, and Software Reliability, CDRH, U.S Food and Drug Administration, Silver Spring, MD, 20993, USA
| | - Berkman Sahiner
- Division of Imaging, Diagnostics, and Software Reliability, CDRH, U.S Food and Drug Administration, Silver Spring, MD, 20993, USA
| | - Nicholas Petrick
- Division of Imaging, Diagnostics, and Software Reliability, CDRH, U.S Food and Drug Administration, Silver Spring, MD, 20993, USA
| | - Aria Pezeshk
- Division of Imaging, Diagnostics, and Software Reliability, CDRH, U.S Food and Drug Administration, Silver Spring, MD, 20993, USA
| |
Collapse
|
16
|
Petrick N, Akbar S, Cha KH, Nofech-Mozes S, Sahiner B, Gavrielides MA, Kalpathy-Cramer J, Drukker K, Martel AL. SPIE-AAPM-NCI BreastPathQ challenge: an image analysis challenge for quantitative tumor cellularity assessment in breast cancer histology images following neoadjuvant treatment. J Med Imaging (Bellingham) 2021; 8:034501. [PMID: 33987451 PMCID: PMC8107263 DOI: 10.1117/1.jmi.8.3.034501] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Accepted: 04/13/2021] [Indexed: 12/20/2022] Open
Abstract
Purpose: The breast pathology quantitative biomarkers (BreastPathQ) challenge was a grand challenge organized jointly by the International Society for Optics and Photonics (SPIE), the American Association of Physicists in Medicine (AAPM), the U.S. National Cancer Institute (NCI), and the U.S. Food and Drug Administration (FDA). The task of the BreastPathQ challenge was computerized estimation of tumor cellularity (TC) in breast cancer histology images following neoadjuvant treatment. Approach: A total of 39 teams developed, validated, and tested their TC estimation algorithms during the challenge. The training, validation, and testing sets consisted of 2394, 185, and 1119 image patches originating from 63, 6, and 27 scanned pathology slides from 33, 4, and 18 patients, respectively. The summary performance metric used for comparing and ranking algorithms was the average prediction probability concordance (PK) using scores from two pathologists as the TC reference standard. Results: Test PK performance ranged from 0.497 to 0.941 across the 100 submitted algorithms. The submitted algorithms generally performed well in estimating TC, with high-performing algorithms obtaining comparable results to the average interrater PK of 0.927 from the two pathologists providing the reference TC scores. Conclusions: The SPIE-AAPM-NCI BreastPathQ challenge was a success, indicating that artificial intelligence/machine learning algorithms may be able to approach human performance for cellularity assessment and may have some utility in clinical practice for improving efficiency and reducing reader variability. The BreastPathQ challenge can be accessed on the Grand Challenge website.
Collapse
Affiliation(s)
- Nicholas Petrick
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Silver Spring, Maryland, United States
| | - Shazia Akbar
- University of Toronto, Medical Biophysics, Toronto, Ontario, Canada
- Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada
| | - Kenny H. Cha
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Silver Spring, Maryland, United States
| | - Sharon Nofech-Mozes
- Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada
- University of Toronto, Department of Laboratory Medicine and Pathobiology, Toronto, Ontario, Canada
| | - Berkman Sahiner
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Silver Spring, Maryland, United States
| | - Marios A. Gavrielides
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Silver Spring, Maryland, United States
| | | | - Karen Drukker
- University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | - Anne L. Martel
- University of Toronto, Medical Biophysics, Toronto, Ontario, Canada
- Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada
| | | |
Collapse
|
17
|
Pennello G, Sahiner B, Gossmann A, Petrick N. Discussion on "Approval policies for modifications to machine learning-based software as a medical device: A study of bio-creep" by Jean Feng, Scott Emerson, and Noah Simon. Biometrics 2020; 77:45-48. [PMID: 33040332 DOI: 10.1111/biom.13381] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2020] [Accepted: 07/28/2020] [Indexed: 10/23/2022]
Affiliation(s)
- Gene Pennello
- Division of Imaging, Diagnostics, and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland, USA
| | - Berkman Sahiner
- Division of Imaging, Diagnostics, and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland, USA
| | - Alexej Gossmann
- Division of Imaging, Diagnostics, and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland, USA
| | - Nicholas Petrick
- Division of Imaging, Diagnostics, and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland, USA
| |
Collapse
|
18
|
Farhangi MM, Petrick N, Sahiner B, Frigui H, Amini AA, Pezeshk A. Recurrent attention network for false positive reduction in the detection of pulmonary nodules in thoracic CT scans. Med Phys 2020; 47:2150-2160. [DOI: 10.1002/mp.14076] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Revised: 12/13/2019] [Accepted: 01/13/2020] [Indexed: 12/19/2022] Open
Affiliation(s)
- M. Mehdi Farhangi
- Division of Imaging, Diagnostics, and Software Reliability (DIDSR) OSEL, CDRH, FDA Silver Spring MD 20993USA
| | - Nicholas Petrick
- Division of Imaging, Diagnostics, and Software Reliability (DIDSR) OSEL, CDRH, FDA Silver Spring MD 20993USA
| | - Berkman Sahiner
- Division of Imaging, Diagnostics, and Software Reliability (DIDSR) OSEL, CDRH, FDA Silver Spring MD 20993USA
| | - Hichem Frigui
- Multimedia Laboratory University of Louisville Louisville KY 40292USA
| | - Amir A. Amini
- Medical Imaging Laboratory University of Louisville Louisville KY 40292USA
| | - Aria Pezeshk
- Division of Imaging, Diagnostics, and Software Reliability (DIDSR) OSEL, CDRH, FDA Silver Spring MD 20993USA
| |
Collapse
|
19
|
Schaffter T, Buist DSM, Lee CI, Nikulin Y, Ribli D, Guan Y, Lotter W, Jie Z, Du H, Wang S, Feng J, Feng M, Kim HE, Albiol F, Albiol A, Morrell S, Wojna Z, Ahsen ME, Asif U, Jimeno Yepes A, Yohanandan S, Rabinovici-Cohen S, Yi D, Hoff B, Yu T, Chaibub Neto E, Rubin DL, Lindholm P, Margolies LR, McBride RB, Rothstein JH, Sieh W, Ben-Ari R, Harrer S, Trister A, Friend S, Norman T, Sahiner B, Strand F, Guinney J, Stolovitzky G. Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms. JAMA Netw Open 2020; 3:e200265. [PMID: 32119094 PMCID: PMC7052735 DOI: 10.1001/jamanetworkopen.2020.0265] [Citation(s) in RCA: 157] [Impact Index Per Article: 39.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/12/2019] [Accepted: 12/26/2019] [Indexed: 12/18/2022] Open
Abstract
Importance Mammography screening currently relies on subjective human interpretation. Artificial intelligence (AI) advances could be used to increase mammography screening accuracy by reducing missed cancers and false positives. Objective To evaluate whether AI can overcome human mammography interpretation limitations with a rigorous, unbiased evaluation of machine learning algorithms. Design, Setting, and Participants In this diagnostic accuracy study conducted between September 2016 and November 2017, an international, crowdsourced challenge was hosted to foster AI algorithm development focused on interpreting screening mammography. More than 1100 participants comprising 126 teams from 44 countries participated. Analysis began November 18, 2016. Main Outcomes and Measurements Algorithms used images alone (challenge 1) or combined images, previous examinations (if available), and clinical and demographic risk factor data (challenge 2) and output a score that translated to cancer yes/no within 12 months. Algorithm accuracy for breast cancer detection was evaluated using area under the curve and algorithm specificity compared with radiologists' specificity with radiologists' sensitivity set at 85.9% (United States) and 83.9% (Sweden). An ensemble method aggregating top-performing AI algorithms and radiologists' recall assessment was developed and evaluated. Results Overall, 144 231 screening mammograms from 85 580 US women (952 cancer positive ≤12 months from screening) were used for algorithm training and validation. A second independent validation cohort included 166 578 examinations from 68 008 Swedish women (780 cancer positive). The top-performing algorithm achieved an area under the curve of 0.858 (United States) and 0.903 (Sweden) and 66.2% (United States) and 81.2% (Sweden) specificity at the radiologists' sensitivity, lower than community-practice radiologists' specificity of 90.5% (United States) and 98.5% (Sweden). Combining top-performing algorithms and US radiologist assessments resulted in a higher area under the curve of 0.942 and achieved a significantly improved specificity (92.0%) at the same sensitivity. Conclusions and Relevance While no single AI algorithm outperformed radiologists, an ensemble of AI algorithms combined with radiologist assessment in a single-reader screening environment improved overall accuracy. This study underscores the potential of using machine learning methods for enhancing mammography screening interpretation.
Collapse
Affiliation(s)
| | - Diana S. M. Buist
- Kaiser Permanente Washington Health Research Institute, Seattle, Washington
| | | | | | - Dezső Ribli
- Department of Physics of Complex Systems, ELTE Eötvös Loránd University, Budapest, Hungary
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, Michigan Medicine, University of Michigan, Ann Arbor
| | | | | | - Hao Du
- National University of Singapore, Singapore
| | - Sijia Wang
- Integrated Health Information Systems Pte Ltd, Singapore
| | - Jiashi Feng
- Department of Electrical and Computer Engineering, National University of Singapore, Singapore
| | | | | | - Francisco Albiol
- Instituto de Física Corpuscular (IFIC), CSIC–Universitat de València, Valencia, Spain
| | - Alberto Albiol
- Universitat Politecnica de Valencia, Valencia, Valenciana, Spain
| | - Stephen Morrell
- Centre for Medical Image Computing, University College London, Bloomsbury, London, United Kingdom
| | | | | | - Umar Asif
- IBM Research Australia, Melbourne, Australia
| | | | | | | | - Darvin Yi
- Stanford University, Stanford, California
| | - Bruce Hoff
- Computational Oncology, Sage Bionetworks, Seattle, Washington
| | - Thomas Yu
- Computational Oncology, Sage Bionetworks, Seattle, Washington
| | | | - Daniel L. Rubin
- Department of Biomedical Data Science, Radiology, and Medicine (Biomedical Informatics), Stanford University, Stanford, California
| | - Peter Lindholm
- Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden
| | - Laurie R. Margolies
- Department of Diagnostic, Molecular and Interventional Radiology, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Russell Bailey McBride
- Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Joseph H. Rothstein
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Weiva Sieh
- Department of Population Health Science and Policy, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Rami Ben-Ari
- IBM Research Haifa, Haifa University Campus, Mount Carmel, Haifa, Israel
| | | | - Andrew Trister
- Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Stephen Friend
- Computational Oncology, Sage Bionetworks, Seattle, Washington
| | - Thea Norman
- Bill and Melinda Gates Foundation, Seattle, Washington
| | - Berkman Sahiner
- Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland
| | - Fredrik Strand
- Department of Oncology-Pathology, Karolinska Institutet, Stockholm, Sweden
- Breast Radiology, Karolinska University Hospital, Stockholm, Sweden
| | - Justin Guinney
- Computational Oncology, Sage Bionetworks, Seattle, Washington
| | - Gustavo Stolovitzky
- IBM Research, Translational Systems Biology and Nanobiotechnology, Thomas J. Watson Research Center, Yorktown Heights, New York
| | | |
Collapse
|
20
|
Cha KH, Petrick N, Pezeshk A, Graff CG, Sharma D, Badal A, Sahiner B. Evaluation of data augmentation via synthetic images for improved breast mass detection on mammograms using deep learning. J Med Imaging (Bellingham) 2020; 7:012703. [PMID: 31763356 PMCID: PMC6872953 DOI: 10.1117/1.jmi.7.1.012703] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Accepted: 09/04/2019] [Indexed: 01/18/2023] Open
Abstract
We evaluated whether using synthetic mammograms for training data augmentation may reduce the effects of overfitting and increase the performance of a deep learning algorithm for breast mass detection. Synthetic mammograms were generated using in silico procedural analytic breast and breast mass modeling algorithms followed by simulated x-ray projections of the breast models into mammographic images. In silico breast phantoms containing masses were modeled across the four BI-RADS breast density categories, and the masses were modeled with different sizes, shapes, and margins. A Monte Carlo-based x-ray transport simulation code, MC-GPU, was used to project the three-dimensional phantoms into realistic synthetic mammograms. 2000 mammograms with 2522 masses were generated to augment a real data set during training. From the Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM) data set, we used 1111 mammograms (1198 masses) for training, 120 mammograms (120 masses) for validation, and 361 mammograms (378 masses) for testing. We used faster R-CNN for our deep learning network with pretraining from ImageNet using the Resnet-101 architecture. We compared the detection performance when the network was trained using different percentages of the real CBIS-DDSM training set (100%, 50%, and 25%), and when these subsets of the training set were augmented with 250, 500, 1000, and 2000 synthetic mammograms. Free-response receiver operating characteristic (FROC) analysis was performed to compare performance with and without the synthetic mammograms. We generally observed an improved test FROC curve when training with the synthetic images compared to training without them, and the amount of improvement depended on the number of real and synthetic images used in training. Our study shows that enlarging the training data with synthetic samples can increase the performance of deep learning systems.
Collapse
Affiliation(s)
- Kenny H. Cha
- U.S. Food and Drug Administration, Silver Spring, Maryland, United States
| | - Nicholas Petrick
- U.S. Food and Drug Administration, Silver Spring, Maryland, United States
| | - Aria Pezeshk
- U.S. Food and Drug Administration, Silver Spring, Maryland, United States
| | - Christian G. Graff
- U.S. Food and Drug Administration, Silver Spring, Maryland, United States
| | - Diksha Sharma
- U.S. Food and Drug Administration, Silver Spring, Maryland, United States
| | - Andreu Badal
- U.S. Food and Drug Administration, Silver Spring, Maryland, United States
| | - Berkman Sahiner
- U.S. Food and Drug Administration, Silver Spring, Maryland, United States
| |
Collapse
|
21
|
Pezeshk A, Hamidian S, Petrick N, Sahiner B. 3-D Convolutional Neural Networks for Automatic Detection of Pulmonary Nodules in Chest CT. IEEE J Biomed Health Inform 2019; 23:2080-2090. [DOI: 10.1109/jbhi.2018.2879449] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
22
|
Gavrielides MA, Li Q, Zeng R, Berman BP, Sahiner B, Gong Q, Myers KJ, DeFilippo G, Petrick N. Discrimination of Pulmonary Nodule Volume Change for Low- and High-contrast Tasks in a Phantom CT Study with Low-dose Protocols. Acad Radiol 2019; 26:937-948. [PMID: 30292564 DOI: 10.1016/j.acra.2018.09.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Revised: 08/30/2018] [Accepted: 09/09/2018] [Indexed: 12/20/2022]
Abstract
RATIONALE AND OBJECTIVES The quantitative assessment of volumetric CT for discriminating small changes in nodule size has been under-examined. This phantom study examined the effect of imaging protocol, nodule size, and measurement method on volume-based change discrimination across low and high object to background contrast tasks. MATERIALS AND METHODS Eight spherical objects ranging in diameter from 5.0 mm to 5.75 mm and 8.0 mm to 8.75 mm with 0.25 mm increments were scanned within an anthropomorphic phantom with either foam-background (high-contrast task, ∼1000 HU object to background difference)) or gelatin-background (low-contrast task, ∼50 to 100 HU difference). Ten repeat acquisitions were collected for each protocol with varying exposures, reconstructed slice thicknesses and reconstruction kernels. Volume measurements were obtained using a matched-filter approach (MF) and a publicly available 3D segmentation-based tool (SB). Discrimination of nodule sizes was assessed using the area under the ROC curve (AUC). RESULTS Using a low-dose (1.3 mGy), thin-slice (≤1.5 mm) protocol, changes of 0.25 mm in diameter were detected with AU = 1.0 for all baseline sizes for the high-contrast task regardless of measurement method. For the more challenging low-contrast task and same protocol, MF detected changes of 0.25 mm from baseline sizes ≥5.25 mm and volume changes ≥9.4% with AUC≥0.81 whereas corresponding results for SB were poor (AUC within 0.49-0.60). Performance for SB was improved, but still inconsistent, when exposure was increased to 4.4 mGy. CONCLUSION The reliable discrimination of small changes in pulmonary nodule size with low-dose, thin-slice CT protocols suitable for lung cancer screening was dependent on the inter-related effects of nodule to background contrast and measurement method.
Collapse
|
23
|
Robins M, Kalpathy-Cramer J, Obuchowski NA, Buckler A, Athelogou M, Jarecha R, Petrick N, Pezeshk A, Sahiner B, Samei E. Evaluation of Simulated Lesions as Surrogates to Clinical Lesions for Thoracic CT Volumetry: The Results of an International Challenge. Acad Radiol 2019; 26:e161-e173. [PMID: 30219290 DOI: 10.1016/j.acra.2018.07.022] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Revised: 07/29/2018] [Accepted: 07/30/2018] [Indexed: 10/28/2022]
Abstract
RATIONALE AND OBJECTIVES To evaluate a new approach to establish compliance of segmentation tools with the computed tomography volumetry profile of the Quantitative Imaging Biomarker Alliance (QIBA); and determine the statistical exchangeability between real and simulated lesions through an international challenge. MATERIALS AND METHODS The study used an anthropomorphic phantom with 16 embedded physical lesions and 30 patient cases from the Reference Image Database to Evaluate Therapy Response with pathologically confirmed malignancies. Hybrid datasets were generated by virtually inserting simulated lesions corresponding to physical lesions into the phantom datasets using one projection-domain-based method (Method 1), two image-domain insertion methods (Methods 2 and 3), and simulated lesions corresponding to real lesions into the Reference Image Database to Evaluate Therapy Response dataset (using Method 2). The volumes of the real and simulated lesions were compared based on bias (measured mean volume differences between physical and virtually inserted lesions in phantoms as quantified by segmentation algorithms), repeatability, reproducibility, equivalence (phantom phase), and overall QIBA compliance (phantom and clinical phase). RESULTS For phantom phase, three of eight groups were fully QIBA compliant, and one was marginally compliant. For compliant groups, the estimated biases were -1.8 ± 1.4%, -2.5 ± 1.1%, -3 ± 1%, -1.8 ± 1.5% (±95% confidence interval). No virtual insertion method showed statistical equivalence to physical insertion in bias equivalence testing using Schuirmann's two one-sided test (±5% equivalence margin). Differences in repeatability and reproducibility across physical and simulated lesions were largely comparable (0.1%-16% and 7%-18% differences, respectively). For clinical phase, 7 of 16 groups were QIBA compliant. CONCLUSION Hybrid datasets yielded conclusions similar to real computed tomography datasets where phantom QIBA compliant was also compliant for hybrid datasets. Some groups deemed compliant for simulated methods, not for physical lesion measurements. The magnitude of this difference was small (<5.4%). While technical performance is not equivalent, they correlate, such that, volumetrically simulated lesions could potentially serve as practical proxies.
Collapse
|
24
|
Keenan KE, Biller JR, Delfino JG, Boss MA, Does MD, Evelhoch JL, Griswold MA, Gunter JL, Hinks RS, Hoffman SW, Kim G, Lattanzi R, Li X, Marinelli L, Metzger GJ, Mukherjee P, Nordstrom RJ, Peskin AP, Perez E, Russek SE, Sahiner B, Serkova N, Shukla-Dave A, Steckner M, Stupic KF, Wilmes LJ, Wu HH, Zhang H, Jackson EF, Sullivan DC. Recommendations towards standards for quantitative MRI (qMRI) and outstanding needs. J Magn Reson Imaging 2019; 49:e26-e39. [PMID: 30680836 PMCID: PMC6663309 DOI: 10.1002/jmri.26598] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Revised: 11/16/2018] [Accepted: 11/16/2018] [Indexed: 12/12/2022] Open
Abstract
LEVEL OF EVIDENCE 5 Technical Efficacy: Stage 5 J. Magn. Reson. Imaging 2019.
Collapse
Affiliation(s)
- Kathryn E Keenan
- Physical Measurement Laboratory, National Institute of Standards and Technology, Boulder, Colorado, USA
| | - Joshua R Biller
- Physical Measurement Laboratory, National Institute of Standards and Technology, Boulder, Colorado, USA
| | - Jana G Delfino
- Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland, USA
| | - Michael A Boss
- Physical Measurement Laboratory, National Institute of Standards and Technology, Boulder, Colorado, USA
- Department of Physics, University of Colorado, Boulder, Colorado, USA
| | - Mark D Does
- Vanderbilt University Institute of Imaging Science, Vanderbilt University, Nashville, Tennessee, USA
| | | | - Mark A Griswold
- Department of Radiology, Case Western Reserve University, Cleveland, Ohio, USA
| | - Jeffrey L Gunter
- Departments of Radiology and Information Technology, Mayo Clinic, Rochester, Minnesota, USA
| | | | - Stuart W Hoffman
- Rehabilitation Research and Development Service, Department of Veterans Affairs, Washington, DC, USA
| | - Geena Kim
- College of Computer & Information Sciences, Regis University, Denver, Colorado, USA
| | - Riccardo Lattanzi
- Department of Radiology, New York University School of Medicine, New York, New York, USA
| | - Xiaojuan Li
- Program of Advanced Musculoskeletal Imaging (PAMI), Cleveland Clinic, Cleveland, Ohio, USA
| | | | - Gregory J Metzger
- Department of Radiology, University of Minnesota, Minneapolis, Minnesota, USA
| | - Pratik Mukherjee
- Department of Radiology, University of California San Francisco, San Francisco, California, USA
| | | | - Adele P Peskin
- Information Technology Laboratory, National Institute of Standards and Technology, Boulder, Colorado, USA
| | | | - Stephen E Russek
- Physical Measurement Laboratory, National Institute of Standards and Technology, Boulder, Colorado, USA
| | - Berkman Sahiner
- Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland, USA
| | - Natalie Serkova
- Department of Radiology, Anschutz Medical Center, Aurora, Colorado, USA
| | - Amita Shukla-Dave
- Departments of Medical Physics and Radiology, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| | | | - Karl F Stupic
- Physical Measurement Laboratory, National Institute of Standards and Technology, Boulder, Colorado, USA
| | - Lisa J Wilmes
- Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, California, USA
| | - Holden H Wu
- Department of Radiological Sciences, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, USA
| | | | - Edward F Jackson
- Department of Medical Physics, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin, USA
| | - Daniel C Sullivan
- Department of Radiology, Duke University Medical Center, Durham, North Carolina, USA
| |
Collapse
|
25
|
Gallas BD, Chen W, Cole E, Ochs R, Petrick N, Pisano ED, Sahiner B, Samuelson FW, Myers KJ. Impact of prevalence and case distribution in lab-based diagnostic imaging studies. J Med Imaging (Bellingham) 2019; 6:015501. [PMID: 30713851 PMCID: PMC6340399 DOI: 10.1117/1.jmi.6.1.015501] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2018] [Accepted: 12/17/2018] [Indexed: 11/14/2022] Open
Abstract
We investigated effects of prevalence and case distribution on radiologist diagnostic performance as measured by area under the receiver operating characteristic curve (AUC) and sensitivity-specificity in lab-based reader studies evaluating imaging devices. Our retrospective reader studies compared full-field digital mammography (FFDM) to screen-film mammography (SFM) for women with dense breasts. Mammograms were acquired from the prospective Digital Mammographic Imaging Screening Trial. We performed five reader studies that differed in terms of cancer prevalence and the distribution of noncancers. Twenty radiologists participated in each reader study. Using split-plot study designs, we collected recall decisions and multilevel scores from the radiologists for calculating sensitivity, specificity, and AUC. Differences in reader-averaged AUCs slightly favored SFM over FFDM (biggest AUC difference: 0.047, SE = 0.023 , p = 0.047 ), where standard error accounts for reader and case variability. The differences were not significant at a level of 0.01 (0.05/5 reader studies). The differences in sensitivities and specificities were also indeterminate. Prevalence had little effect on AUC (largest difference: 0.02), whereas sensitivity increased and specificity decreased as prevalence increased. We found that AUC is robust to changes in prevalence, while radiologists were more aggressive with recall decisions as prevalence increased.
Collapse
Affiliation(s)
- Brandon D. Gallas
- FDA/CDRH/OSEL/Division of Imaging, Diagnostics, and Software Reliability, Silver Spring, Maryland, United States
| | - Weijie Chen
- FDA/CDRH/OSEL/Division of Imaging, Diagnostics, and Software Reliability, Silver Spring, Maryland, United States
| | - Elodia Cole
- Medical University of South Carolina, Charleston, South Carolina, United States
| | - Robert Ochs
- FDA/CDRH/OIR/Division of Radiological Health, Silver Spring, Maryland, United States
| | - Nicholas Petrick
- FDA/CDRH/OSEL/Division of Imaging, Diagnostics, and Software Reliability, Silver Spring, Maryland, United States
| | - Etta D. Pisano
- Medical University of South Carolina, Charleston, South Carolina, United States
| | - Berkman Sahiner
- FDA/CDRH/OSEL/Division of Imaging, Diagnostics, and Software Reliability, Silver Spring, Maryland, United States
| | - Frank W. Samuelson
- FDA/CDRH/OSEL/Division of Imaging, Diagnostics, and Software Reliability, Silver Spring, Maryland, United States
| | - Kyle J. Myers
- FDA/CDRH/OSEL/Division of Imaging, Diagnostics, and Software Reliability, Silver Spring, Maryland, United States
| |
Collapse
|
26
|
Sahiner B, Pezeshk A, Hadjiiski LM, Wang X, Drukker K, Cha KH, Summers RM, Giger ML. Deep learning in medical imaging and radiation therapy. Med Phys 2018; 46:e1-e36. [PMID: 30367497 DOI: 10.1002/mp.13264] [Citation(s) in RCA: 354] [Impact Index Per Article: 59.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2018] [Revised: 09/18/2018] [Accepted: 10/09/2018] [Indexed: 12/15/2022] Open
Abstract
The goals of this review paper on deep learning (DL) in medical imaging and radiation therapy are to (a) summarize what has been achieved to date; (b) identify common and unique challenges, and strategies that researchers have taken to address these challenges; and (c) identify some of the promising avenues for the future both in terms of applications as well as technical innovations. We introduce the general principles of DL and convolutional neural networks, survey five major areas of application of DL in medical imaging and radiation therapy, identify common themes, discuss methods for dataset expansion, and conclude by summarizing lessons learned, remaining challenges, and future directions.
Collapse
Affiliation(s)
- Berkman Sahiner
- DIDSR/OSEL/CDRH U.S. Food and Drug Administration, Silver Spring, MD, 20993, USA
| | - Aria Pezeshk
- DIDSR/OSEL/CDRH U.S. Food and Drug Administration, Silver Spring, MD, 20993, USA
| | | | - Xiaosong Wang
- Imaging Biomarkers and Computer-aided Diagnosis Lab, Radiology and Imaging Sciences, NIH Clinical Center, Bethesda, MD, 20892-1182, USA
| | - Karen Drukker
- Department of Radiology, University of Chicago, Chicago, IL, 60637, USA
| | - Kenny H Cha
- DIDSR/OSEL/CDRH U.S. Food and Drug Administration, Silver Spring, MD, 20993, USA
| | - Ronald M Summers
- Imaging Biomarkers and Computer-aided Diagnosis Lab, Radiology and Imaging Sciences, NIH Clinical Center, Bethesda, MD, 20892-1182, USA
| | - Maryellen L Giger
- Department of Radiology, University of Chicago, Chicago, IL, 60637, USA
| |
Collapse
|
27
|
Ghanian Z, Pezeshk A, Petrick N, Sahiner B. Computational insertion of microcalcification clusters on mammograms: reader differentiation from native clusters and computer-aided detection comparison. J Med Imaging (Bellingham) 2018; 5:044502. [PMID: 30840741 DOI: 10.1117/1.jmi.5.4.044502] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2018] [Accepted: 10/10/2018] [Indexed: 11/14/2022] Open
Abstract
Mammographic computer-aided detection (CADe) devices are typically first developed and assessed for a specific "original" acquisition system. When developers are ready to apply their CADe device to a mammographic acquisition system, they typically assess the device with images acquired using the system. Collecting large repositories of clinical images containing verified lesion locations acquired by a system is costly and time consuming. We previously developed an image blending technique that allows users to seamlessly insert regions of interest (ROIs) from one medical image into another image. Our goal is to assess the performance of this technique for inserting microcalcification clusters from one mammogram into another, with the idea that when fully developed, our technique may be useful for reducing the clinical data burden in the assessment of a CADe device for use with an image acquisition system. We first perform a reader study to assess whether experienced observers can distinguish between computationally inserted and native clusters. For this purpose, we apply our insertion technique to 55 clinical cases. ROIs containing microcalcification clusters from one breast of a patient are inserted into the contralateral breast of the same patient. The analysis of the reader ratings using receiver operating characteristic (ROC) methodology indicates that inserted clusters cannot be reliably distinguished from native clusters (area under the ROC curve = 0.58 ± 0.04 ). Furthermore, CADe sensitivity is evaluated on mammograms of 68 clinical cases with native and inserted microcalcification clusters using a commercial CADe system. The average by-case sensitivities for native and inserted clusters are equal, 85.3% (58/68). The average by-image sensitivities for native and inserted clusters are 72.3% and 67.6%, respectively, with a difference of 4.7% and a 95% confidence interval of [ - 2.1 11.6]. These results demonstrate the potential for using the inserted microcalcification clusters for assessing mammographic CADe devices.
Collapse
Affiliation(s)
- Zahra Ghanian
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Silver Spring, Maryland, United States
| | - Aria Pezeshk
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Silver Spring, Maryland, United States
| | - Nicholas Petrick
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Silver Spring, Maryland, United States
| | - Berkman Sahiner
- U.S. Food and Drug Administration, Center for Devices and Radiological Health, Silver Spring, Maryland, United States
| |
Collapse
|
28
|
Li Q, Berman BP, Hagio T, Gavrielides MA, Zeng R, Sahiner B, Gong Q, Fang Y, Liu S, Petrick N. Coronary artery calcium quantification using contrast-enhanced dual-energy computed tomography scans in comparison with unenhanced single-energy scans. Phys Med Biol 2018; 63:175006. [PMID: 30101756 PMCID: PMC6183065 DOI: 10.1088/1361-6560/aad9be] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Extracting coronary artery calcium (CAC) scores from contrast-enhanced computed tomography (CT) images using dual-energy (DE) based material decomposition has been shown feasible, mainly through patient studies. However, the quantitative performance of such DE-based CAC scores, particularly per stenosis, is underexamined due to lack of reference standard and repeated scans. In this work we conducted a comprehensive quantitative comparative analysis of CAC scores obtained with DE and compare to conventional unenhanced single-energy (SE) CT scans through phantom studies. Synthetic vessels filled with iodinated blood mimicking material and containing calcium stenoses of different sizes and densities were scanned with a third generation dual-source CT scanner in a chest phantom using a DE coronary CT angiography protocol with three exposures/CTDIvol: auto-mAs/8 mGy (automatic exposure), 160 mAs/20 mGy and 260 mAs/34 mGy and 10 repeats. As a control, a set of vessel phantoms without iodine was scanned using a standard SE CAC score protocol (3 mGy). Calcium volume, mass and Agatston scores were estimated for each stenosis. For DE dataset, image-based three-material decomposition was applied to remove iodine before scoring. Performance of DE-based calcium scores were analyzed on a per-stenosis level and compared to SE-based scores. There was excellent correlation between the DE- and SE-based scores (correlation coefficient r: 0.92-0.98). Percent bias for the calcium volume and mass scores varied as a function of stenosis size and density for both modalities. Precision (coefficient of variation) improved with larger and denser stenoses for both DE- and SE-based calcium scores. DE-based scores (20 mGy and 34 mGy) provided comparable per-stenosis precision to SE-based (3 mGy). Our findings suggest that on a per-stenosis level, DE-based CAC scores from contrast-enhanced CT images can achieve comparable quantification performance to conventional SE-based scores. However, DE-based CAC scoring required more dose compared with SE for high per-stenosis precision so some caution is necessary with clinical DE-based CAC scoring.
Collapse
Affiliation(s)
- Qin Li
- US Food and Drug Administration, CDRH/OSEL/DIDSR, Silver Spring, MD, United States of America
| | | | | | | | | | | | | | | | | | | |
Collapse
|
29
|
Senaras C, Niazi MKK, Sahiner B, Pennell MP, Tozbikian G, Lozanski G, Gurcan MN. Optimized generation of high-resolution phantom images using cGAN: Application to quantification of Ki67 breast cancer images. PLoS One 2018; 13:e0196846. [PMID: 29742125 PMCID: PMC5942823 DOI: 10.1371/journal.pone.0196846] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 04/20/2018] [Indexed: 11/29/2022] Open
Abstract
In pathology, Immunohistochemical staining (IHC) of tissue sections is regularly used to diagnose and grade malignant tumors. Typically, IHC stain interpretation is rendered by a trained pathologist using a manual method, which consists of counting each positively- and negatively-stained cell under a microscope. The manual enumeration suffers from poor reproducibility even in the hands of expert pathologists. To facilitate this process, we propose a novel method to create artificial datasets with the known ground truth which allows us to analyze the recall, precision, accuracy, and intra- and inter-observer variability in a systematic manner, enabling us to compare different computer analysis approaches. Our method employs a conditional Generative Adversarial Network that uses a database of Ki67 stained tissues of breast cancer patients to generate synthetic digital slides. Our experiments show that synthetic images are indistinguishable from real images. Six readers (three pathologists and three image analysts) tried to differentiate 15 real from 15 synthetic images and the probability that the average reader would be able to correctly classify an image as synthetic or real more than 50% of the time was only 44.7%.
Collapse
Affiliation(s)
- Caglar Senaras
- Center for Biomedical Informatics, Wake Forest School of Medicine, Winston-Salem, North Carolina, United States of America
| | - Muhammad Khalid Khan Niazi
- Center for Biomedical Informatics, Wake Forest School of Medicine, Winston-Salem, North Carolina, United States of America
| | - Berkman Sahiner
- Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland, United States of America
| | - Michael P. Pennell
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, Ohio, United States of America
| | - Gary Tozbikian
- Department of Pathology, The Ohio State University, Columbus, Ohio, United States of America
| | - Gerard Lozanski
- Department of Pathology, The Ohio State University, Columbus, Ohio, United States of America
| | - Metin N. Gurcan
- Center for Biomedical Informatics, Wake Forest School of Medicine, Winston-Salem, North Carolina, United States of America
| |
Collapse
|
30
|
Abstract
Scores produced by statistical classifiers in many clinical decision support systems and other medical diagnostic devices are generally on an arbitrary scale, so the clinical meaning of these scores is unclear. Calibration of classifier scores to a meaningful scale such as the probability of disease is potentially useful when such scores are used by a physician. In this work, we investigated three methods (parametric, semi-parametric, and non-parametric) for calibrating classifier scores to the probability of disease scale and developed uncertainty estimation techniques for these methods. We showed that classifier scores on arbitrary scales can be calibrated to the probability of disease scale without affecting their discrimination performance. With a finite dataset to train the calibration function, it is important to accompany the probability estimate with its confidence interval. Our simulations indicate that, when a dataset used for finding the transformation for calibration is also used for estimating the performance of calibration, the resubstitution bias exists for a performance metric involving the truth states in evaluating the calibration performance. However, the bias is small for the parametric and semi-parametric methods when the sample size is moderate to large (>100 per class).
Collapse
Affiliation(s)
- Weijie Chen
- Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, USA
| | - Berkman Sahiner
- Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, USA
| | - Frank Samuelson
- Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, USA
| | - Aria Pezeshk
- Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, USA
| | - Nicholas Petrick
- Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, USA
| |
Collapse
|
31
|
Schoener B, Baird P, Dorn L, Giuliano KK, Ho M, Jump M, Sahiner B, Zink R. Using Data-Based Decisions to Transform Health Technology and Improve Patient Care. Biomed Instrum Technol 2018; 52:7-16. [PMID: 29775385 DOI: 10.2345/0899-8205-52.s2.7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
|
32
|
Robins M, Solomon J, Sahbaee P, Sedlmair M, Roy Choudhury K, Pezeshk A, Sahiner B, Samei E. Techniques for virtual lung nodule insertion: volumetric and morphometric comparison of projection-based and image-based methods for quantitative CT. Phys Med Biol 2017; 62:7280-7299. [PMID: 28786399 DOI: 10.1088/1361-6560/aa83f8] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Virtual nodule insertion paves the way towards the development of standardized databases of hybrid CT images with known lesions. The purpose of this study was to assess three methods (an established and two newly developed techniques) for inserting virtual lung nodules into CT images. Assessment was done by comparing virtual nodule volume and shape to the CT-derived volume and shape of synthetic nodules. 24 synthetic nodules (three sizes, four morphologies, two repeats) were physically inserted into the lung cavity of an anthropomorphic chest phantom (KYOTO KAGAKU). The phantom was imaged with and without nodules on a commercial CT scanner (SOMATOM Definition Flash, Siemens) using a standard thoracic CT protocol at two dose levels (1.4 and 22 mGy CTDIvol). Raw projection data were saved and reconstructed with filtered back-projection and sinogram affirmed iterative reconstruction (SAFIRE, strength 5) at 0.6 mm slice thickness. Corresponding 3D idealized, virtual nodule models were co-registered with the CT images to determine each nodule's location and orientation. Virtual nodules were voxelized, partial volume corrected, and inserted into nodule-free CT data (accounting for system imaging physics) using two methods: projection-based Technique A, and image-based Technique B. Also a third Technique C based on cropping a region of interest from the acquired image of the real nodule and blending it into the nodule-free image was tested. Nodule volumes were measured using a commercial segmentation tool (iNtuition, TeraRecon, Inc.) and deformation was assessed using the Hausdorff distance. Nodule volumes and deformations were compared between the idealized, CT-derived and virtual nodules using a linear mixed effects regression model which utilized the mean, standard deviation, and coefficient of variation ([Formula: see text], [Formula: see text] and [Formula: see text] of the regional Hausdorff distance. Overall, there was a close concordance between the volumes of the CT-derived and virtual nodules. Percent differences between them were less than 3% for all insertion techniques and were not statistically significant in most cases. Correlation coefficient values were greater than 0.97. The deformation according to the Hausdorff distance was also similar between the CT-derived and virtual nodules with minimal statistical significance in the ([Formula: see text]) for Techniques A, B, and C. This study shows that both projection-based and image-based nodule insertion techniques yield realistic nodule renderings with statistical similarity to the synthetic nodules with respect to nodule volume and deformation. These techniques could be used to create a database of hybrid CT images containing nodules of known size, location and morphology.
Collapse
Affiliation(s)
- Marthony Robins
- Carl E. Ravin Advanced Imaging Laboratories, Department of Radiology, Medical Physics Graduate Program, Duke University Medical Center, Durham, NC 27705, United States of America
| | | | | | | | | | | | | | | |
Collapse
|
33
|
Liu J, Wang D, Lu L, Wei Z, Kim L, Turkbey EB, Sahiner B, Petrick NA, Summers RM. Detection and diagnosis of colitis on computed tomography using deep convolutional neural networks. Med Phys 2017; 44:4630-4642. [PMID: 28594460 DOI: 10.1002/mp.12399] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2016] [Revised: 05/05/2017] [Accepted: 05/24/2017] [Indexed: 01/15/2023] Open
Abstract
PURPOSE Colitis refers to inflammation of the inner lining of the colon that is frequently associated with infection and allergic reactions. In this paper, we propose deep convolutional neural networks methods for lesion-level colitis detection and a support vector machine (SVM) classifier for patient-level colitis diagnosis on routine abdominal CT scans. METHODS The recently developed Faster Region-based Convolutional Neural Network (Faster RCNN) is utilized for lesion-level colitis detection. For each 2D slice, rectangular region proposals are generated by region proposal networks (RPN). Then, each region proposal is jointly classified and refined by a softmax classifier and bounding-box regressor. Two convolutional neural networks, eight layers of ZF net and 16 layers of VGG net are compared for colitis detection. Finally, for each patient, the detections on all 2D slices are collected and a SVM classifier is applied to develop a patient-level diagnosis. We trained and evaluated our method with 80 colitis patients and 80 normal cases using 4 × 4-fold cross validation. RESULTS For lesion-level colitis detection, with ZF net, the mean of average precisions (mAP) were 48.7% and 50.9% for RCNN and Faster RCNN, respectively. The detection system achieved sensitivities of 51.4% and 54.0% at two false positives per patient for RCNN and Faster RCNN, respectively. With VGG net, Faster RCNN increased the mAP to 56.9% and increased the sensitivity to 58.4% at two false positive per patient. For patient-level colitis diagnosis, with ZF net, the average areas under the ROC curve (AUC) were 0.978 ± 0.009 and 0.984 ± 0.008 for RCNN and Faster RCNN method, respectively. The difference was not statistically significant with P = 0.18. At the optimal operating point, the RCNN method correctly identified 90.4% (72.3/80) of the colitis patients and 94.0% (75.2/80) of normal cases. The sensitivity improved to 91.6% (73.3/80) and the specificity improved to 95.0% (76.0/80) for the Faster RCNN method. With VGG net, Faster RCNN increased the AUC to 0.986 ± 0.007 and increased the diagnosis sensitivity to 93.7% (75.0/80) and specificity was unchanged at 95.0% (76.0/80). CONCLUSION Colitis detection and diagnosis by deep convolutional neural networks is accurate and promising for future clinical application.
Collapse
Affiliation(s)
- Jiamin Liu
- Imaging Biomarkers and Computer-aided Diagnosis Laboratory and Clinical Image Processing Service, Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD, 20892-1182, USA
| | - David Wang
- Imaging Biomarkers and Computer-aided Diagnosis Laboratory and Clinical Image Processing Service, Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD, 20892-1182, USA
| | - Le Lu
- Imaging Biomarkers and Computer-aided Diagnosis Laboratory and Clinical Image Processing Service, Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD, 20892-1182, USA
| | - Zhuoshi Wei
- Imaging Biomarkers and Computer-aided Diagnosis Laboratory and Clinical Image Processing Service, Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD, 20892-1182, USA
| | - Lauren Kim
- Imaging Biomarkers and Computer-aided Diagnosis Laboratory and Clinical Image Processing Service, Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD, 20892-1182, USA
| | - Evrim B Turkbey
- Imaging Biomarkers and Computer-aided Diagnosis Laboratory and Clinical Image Processing Service, Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD, 20892-1182, USA
| | | | | | - Ronald M Summers
- Imaging Biomarkers and Computer-aided Diagnosis Laboratory and Clinical Image Processing Service, Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD, 20892-1182, USA
| |
Collapse
|
34
|
Abas FS, Shana’ah A, Christian B, Hasserjian R, Louissaint A, Pennell M, Sahiner B, Chen W, Niazi MKK, Lozanski G, Gurcan M. Computer-assisted quantification of CD3+ T cells in follicular lymphoma. Cytometry A 2017; 91:609-621. [PMID: 28110507 PMCID: PMC10680104 DOI: 10.1002/cyto.a.23049] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Accepted: 12/19/2016] [Indexed: 01/01/2023]
Abstract
The advance of high resolution digital scans of pathology slides allowed development of computer based image analysis algorithms that may help pathologists in IHC stains quantification. While very promising, these methods require further refinement before they are implemented in routine clinical setting. Particularly critical is to evaluate algorithm performance in a setting similar to current clinical practice. In this article, we present a pilot study that evaluates the use of a computerized cell quantification method in the clinical estimation of CD3 positive (CD3+) T cells in follicular lymphoma (FL). Our goal is to demonstrate the degree to which computerized quantification is comparable to the practice of estimation by a panel of expert pathologists. The computerized quantification method uses entropy based histogram thresholding to separate brown (CD3+) and blue (CD3-) regions after a color space transformation. A panel of four board-certified hematopathologists evaluated a database of 20 FL images using two different reading methods: visual estimation and manual marking of each CD3+ cell in the images. These image data and the readings provided a reference standard and the range of variability among readers. Sensitivity and specificity measures of the computer's segmentation of CD3+ and CD- T cell are recorded. For all four pathologists, mean sensitivity and specificity measures are 90.97 and 88.38%, respectively. The computerized quantification method agrees more with the manual cell marking as compared to the visual estimations. Statistical comparison between the computerized quantification method and the pathologist readings demonstrated good agreement with correlation coefficient values of 0.81 and 0.96 in terms of Lin's concordance correlation and Spearman's correlation coefficient, respectively. These values are higher than most of those calculated among the pathologists. In the future, the computerized quantification method may be used to investigate the relationship between the overall architectural pattern (i.e., interfollicular vs. follicular) and outcome measures (e.g., overall survival, and time to treatment). © 2017 International Society for Advancement of Cytometry.
Collapse
Affiliation(s)
- Fazly S. Abas
- Center for e-Health, Faculty of Engineering and Technology, Multimedia University, Melaka, Malaysia
| | - Arwa Shana’ah
- Department of Pathology, The Ohio State University, Columbus, Ohio
| | - Beth Christian
- Department of Internal Medicine, The Ohio State University, Columbus, Ohio
| | - Robert Hasserjian
- Department of Pathology, Massachusetts General Hospital, Boston, Massachusetts
| | - Abner Louissaint
- Department of Pathology, Massachusetts General Hospital, Boston, Massachusetts
| | - Michael Pennell
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, Ohio
| | - Berkman Sahiner
- Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland
| | - Weijie Chen
- Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland
| | | | - Gerard Lozanski
- Department of Pathology, The Ohio State University, Columbus, Ohio
| | - Metin Gurcan
- Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio
| |
Collapse
|
35
|
Abstract
The performance of a classifier is largely dependent on the size and representativeness of data used for its training. In circumstances where accumulation and/or labeling of training samples is difficult or expensive, such as medical applications, data augmentation can potentially be used to alleviate the limitations of small datasets. We have previously developed an image blending tool that allows users to modify or supplement an existing CT or mammography dataset by seamlessly inserting a lesion extracted from a source image into a target image. This tool also provides the option to apply various types of transformations to different properties of the lesion prior to its insertion into a new location. In this study, we used this tool to create synthetic samples that appear realistic in chest CT. We then augmented different size training sets with these artificial samples, and investigated the effect of the augmentation on training various classifiers for the detection of lung nodules. Our results indicate that the proposed lesion insertion method can improve classifier performance for small training datasets, and thereby help reduce the need to acquire and label actual patient data.
Collapse
|
36
|
Hamidian S, Sahiner B, Petrick N, Pezeshk A. 3D Convolutional Neural Network for Automatic Detection of Lung Nodules in Chest CT. Proc SPIE Int Soc Opt Eng 2017; 10134:1013409. [PMID: 28845077 PMCID: PMC5568782 DOI: 10.1117/12.2255795] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Deep convolutional neural networks (CNNs) form the backbone of many state-of-the-art computer vision systems for classification and segmentation of 2D images. The same principles and architectures can be extended to three dimensions to obtain 3D CNNs that are suitable for volumetric data such as CT scans. In this work, we train a 3D CNN for automatic detection of pulmonary nodules in chest CT images using volumes of interest extracted from the LIDC dataset. We then convert the 3D CNN which has a fixed field of view to a 3D fully convolutional network (FCN) which can generate the score map for the entire volume efficiently in a single pass. Compared to the sliding window approach for applying a CNN across the entire input volume, the FCN leads to a nearly 800-fold speed-up, and thereby fast generation of output scores for a single case. This screening FCN is used to generate difficult negative examples that are used to train a new discriminant CNN. The overall system consists of the screening FCN for fast generation of candidate regions of interest, followed by the discrimination CNN.
Collapse
Affiliation(s)
- Sardar Hamidian
- Department of Computer Science, George Washington University, Washington, DC
| | - Berkman Sahiner
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD
| | - Nicholas Petrick
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD
| | - Aria Pezeshk
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD
| |
Collapse
|
37
|
Senaras C, Pennell M, Chen W, Sahiner B, Shana'ah A, Louissaint A, Hasserjian RP, Lozanski G, Gurcan MN. FOXP3-stained image analysis for follicular lymphoma: Optimal adaptive thresholding with maximal nucleus coverage. Proc SPIE Int Soc Opt Eng 2017; 10140. [PMID: 28579665 DOI: 10.1117/12.2255671] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Immunohistochemical detection of FOXP3 antigen is a usable marker for detection of regulatory T lymphocytes (TR) in formalin fixed and paraffin embedded sections of different types of tumor tissue. TR plays a major role in homeostasis of normal immune systems where they prevent auto reactivity of the immune system towards the host. This beneficial effect of TR is frequently "hijacked" by malignant cells where tumor-infiltrating regulatory T cells are recruited by the malignant nuclei to inhibit the beneficial immune response of the host against the tumor cells. In the majority of human solid tumors, an increased number of tumor-infiltrating FOXP3 positive TR is associated with worse outcome. However, in follicular lymphoma (FL) the impact of the number and distribution of TR on the outcome still remains controversial. In this study, we present a novel method to detect and enumerate nuclei from FOXP3 stained images of FL biopsies. The proposed method defines a new adaptive thresholding procedure, namely the optimal adaptive thresholding (OAT) method, which aims to minimize under-segmented and over-segmented nuclei for coarse segmentation. Next, we integrate a parameter free elliptical arc and line segment detector (ELSD) as additional information to refine segmentation results and to split most of the merged nuclei. Finally, we utilize a state-of-the-art super-pixel method, Simple Linear Iterative Clustering (SLIC) to split the rest of the merged nuclei. Our dataset consists of 13 region-of-interest images containing 769 negative and 88 positive nuclei. Three expert pathologists evaluated the method and reported sensitivity values in detecting negative and positive nuclei ranging from 83-100% and 90-95%, and precision values of 98-100% and 99-100%, respectively. The proposed solution can be used to investigate the impact of FOXP3 positive nuclei on the outcome and prognosis in FL.
Collapse
Affiliation(s)
- C Senaras
- Department of Biomedical Informatics, The Ohio State University
| | - M Pennell
- Division of Biostatistics, College of Public Health, The Ohio State University
| | - W Chen
- Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, S
| | - B Sahiner
- Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, S
| | - A Shana'ah
- Department of Pathology, The Ohio State University
| | - A Louissaint
- Department of Pathology Massachusetts General Hospital
| | | | - G Lozanski
- Department of Pathology, The Ohio State University
| | - M N Gurcan
- Department of Biomedical Informatics, The Ohio State University
| |
Collapse
|
38
|
Li Q, Liu S, Myers KJ, Gavrielides MA, Zeng R, Sahiner B, Petrick N. Impact of Reconstruction Algorithms and Gender-Associated Anatomy on Coronary Calcium Scoring with CT: An Anthropomorphic Phantom Study. Acad Radiol 2016; 23:1470-1479. [PMID: 27665673 PMCID: PMC5567798 DOI: 10.1016/j.acra.2016.08.014] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2016] [Revised: 07/20/2016] [Accepted: 08/01/2016] [Indexed: 10/21/2022]
Abstract
RATIONALE AND OBJECTIVES Different computed tomography imaging protocols and patient characteristics can impact the accuracy and precision of the calcium score and may lead to inconsistent patient treatment recommendations. The aim of this work was to determine the impact of reconstruction algorithm and gender characteristics on coronary artery calcium scoring based on a phantom study using computed tomography. MATERIALS AND METHODS Four synthetic heart vessels with vessel diameters corresponding to female and male left main and left circumflex arteries containing calcification-mimicking materials (200-1000 HU) were inserted into a thorax phantom and were scanned with and without female breast plates (male and female phantoms, respectively). Ten scans were acquired and were reconstructed at 3-mm slices using filtered-back projection (FBP) and iterative reconstruction with medium and strong denoising (IR3 and IR5) algorithms. Agatston and calcium volume scores were estimated for each vessel. Calcium scores for each vessel and the total calcium score (summation of all four vessels) were compared between the two phantoms to quantify the impact of the breast plates and reconstruction parameters. Calcium scores were also compared among vessels of different diameters to investigate the impact of the vessel size. RESULTS The calcium scores were significantly larger for FBP reconstruction (FBP > IR3>IR5). Agatston scores (calcium volume score) for vessels in the male phantom scans were on average 4.8% (2.9%), 8.2% (7.1%), and 10.5% (9.4%) higher compared to those in the female phantom with FBP, IR3, and IR5, respectively, when exposure was conserved across phantoms. The total calcium scores from the male phantom were significantly larger than those from the female phantom (P <0.05). In general, calcium volume scores were underestimated (up to about 50%) for smaller vessels, especially when scanned in the female phantom. CONCLUSIONS Calcium scores significantly decreased with iterative reconstruction and tended to be underestimated for female anatomy (smaller vessels and presence of breast plates).
Collapse
Affiliation(s)
- Qin Li
- U.S. Food and Drug Administration, CDRH/OSEL/DIDSR, 10903 New Hampshire Ave., Bldg. 62 Rm. 4110, Silver Spring, MD 20993.
| | - Songtao Liu
- U.S. Food and Drug Administration, CDRH/OIR/DRH, Silver Spring, Maryland
| | - Kyle J Myers
- U.S. Food and Drug Administration, CDRH/OSEL/DIDSR, 10903 New Hampshire Ave., Bldg. 62 Rm. 4110, Silver Spring, MD 20993
| | - Marios A Gavrielides
- U.S. Food and Drug Administration, CDRH/OSEL/DIDSR, 10903 New Hampshire Ave., Bldg. 62 Rm. 4110, Silver Spring, MD 20993
| | - Rongping Zeng
- U.S. Food and Drug Administration, CDRH/OSEL/DIDSR, 10903 New Hampshire Ave., Bldg. 62 Rm. 4110, Silver Spring, MD 20993
| | - Berkman Sahiner
- U.S. Food and Drug Administration, CDRH/OSEL/DIDSR, 10903 New Hampshire Ave., Bldg. 62 Rm. 4110, Silver Spring, MD 20993
| | - Nicholas Petrick
- U.S. Food and Drug Administration, CDRH/OSEL/DIDSR, 10903 New Hampshire Ave., Bldg. 62 Rm. 4110, Silver Spring, MD 20993
| |
Collapse
|
39
|
Li Q, Gavrielides MA, Sahiner B, Myers KJ, Zeng R, Petrick N. Statistical analysis of lung nodule volume measurements with CT in a large-scale phantom study. Med Phys 2016; 42:3932-47. [PMID: 26133594 DOI: 10.1118/1.4921734] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Open
Abstract
PURPOSE To determine inter-related factors that contribute substantially to measurement error of pulmonary nodule measurements with CT by assessing a large-scale dataset of phantom scans and to quantitatively validate the repeatability and reproducibility of a subset containing nodules and CT acquisitions consistent with the Quantitative Imaging Biomarker Alliance (QIBA) metrology recommendations. METHODS The dataset has about 40 000 volume measurements of 48 nodules (5-20 mm, four shapes, three radiodensities) estimated by a matched-filter estimator from CT images involving 72 imaging protocols. Technical assessment was performed under a framework suggested by QIBA, which aimed to minimize the inconsistency of terminologies and techniques used in the literature. Accuracy and precision of lung nodule volume measurements were examined by analyzing the linearity, bias, variance, root mean square error (RMSE), repeatability, reproducibility, and significant and substantial factors that contribute to the measurement error. Statistical methodologies including linear regression, analysis of variance, and restricted maximum likelihood were applied to estimate the aforementioned metrics. The analysis was performed on both the whole dataset and a subset meeting the criteria proposed in the QIBA Profile document. RESULTS Strong linearity was observed for all data. Size, slice thickness × collimation, and randomness in attachment to vessels or chest wall were the main sources of measurement error. Grouping the data by nodule size and slice thickness × collimation, the standard deviation (3.9%-28%), and RMSE (4.4%-68%) tended to increase with smaller nodule size and larger slice thickness. For 5, 8, 10, and 20 mm nodules with reconstruction slice thickness ≤0.8, 3, 3, and 5 mm, respectively, the measurements were almost unbiased (-3.0% to 3.0%). Repeatability coefficients (RCs) were from 6.2% to 40%. Pitch of 0.9, detail kernel, and smaller slice thicknesses yielded better (smaller) RCs than those from pitch of 1.2, medium kernel, and larger slice thicknesses. Exposure showed no impact on RC. The overall reproducibility coefficient (RDC) was 45%, and reduced to about 20%-30% when the slice thickness and collimation were fixed. For nodules and CT imaging complying with the QIBA Profile (QIBA Profile subset), the measurements were highly repeatable and reproducible in spite of variations in nodule characteristics and imaging protocols. The overall measurement error was small and mostly due to the randomness in attachment. The bias, standard deviation, and RMSE grouped by nodule size and slice thickness × collimation in the QIBA Profile subset were within ±3%, 4%, and 5%, respectively. RCs are within 11% and the overall RDC is equal to 11%. CONCLUSIONS The authors have performed a comprehensive technical assessment of lung nodule volumetry with a matched-filter estimator from CT scans of synthetic nodules and identified the main sources of measurement error among various nodule characteristics and imaging parameters. The results confirm that the QIBA Profile set is highly repeatable and reproducible. These phantom study results can serve as a bound on the clinical performance achievable with volumetric CT measurements of pulmonary nodules.
Collapse
Affiliation(s)
- Qin Li
- Division of Imaging, Diagnostics and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, US Food and Drug Administration, Silver Spring, Maryland 20993
| | - Marios A Gavrielides
- Division of Imaging, Diagnostics and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, US Food and Drug Administration, Silver Spring, Maryland 20993
| | - Berkman Sahiner
- Division of Imaging, Diagnostics and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, US Food and Drug Administration, Silver Spring, Maryland 20993
| | - Kyle J Myers
- Division of Imaging, Diagnostics and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, US Food and Drug Administration, Silver Spring, Maryland 20993
| | - Rongping Zeng
- Division of Imaging, Diagnostics and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, US Food and Drug Administration, Silver Spring, Maryland 20993
| | - Nicholas Petrick
- Division of Imaging, Diagnostics and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, US Food and Drug Administration, Silver Spring, Maryland 20993
| |
Collapse
|
40
|
Gavrielides MA, Li Q, Zeng R, Myers KJ, Sahiner B, Petrick N. Volume estimation of multidensity nodules with thoracic computed tomography. J Med Imaging (Bellingham) 2016; 3:013504. [PMID: 26844235 DOI: 10.1117/1.jmi.3.1.013504] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2015] [Accepted: 12/18/2015] [Indexed: 11/14/2022] Open
Abstract
This work focuses on volume estimation of "multidensity" lung nodules in a phantom computed tomography study. Eight objects were manufactured by enclosing spherical cores within larger spheres of double the diameter but with a different density. Different combinations of outer-shell/inner-core diameters and densities were created. The nodules were placed within an anthropomorphic phantom and scanned with various acquisition and reconstruction parameters. The volumes of the entire multidensity object as well as the inner core of the object were estimated using a model-based volume estimator. Results showed percent volume bias across all nodules and imaging protocols with slice thicknesses [Formula: see text] ranging from [Formula: see text] to 6.6% for the entire object (standard deviation ranged from 1.5% to 7.6%), and within [Formula: see text] to 5.7% for the inner-core measurement (standard deviation ranged from 2.0% to 17.7%). Overall, the estimation error was larger for the inner-core measurements, which was expected due to the smaller size of the core. Reconstructed slice thickness was found to substantially affect volumetric error for both tasks; exposure and reconstruction kernel were not. These findings provide information for understanding uncertainty in volumetry of nodules that include multiple densities such as ground glass opacities with a solid component.
Collapse
Affiliation(s)
- Marios A Gavrielides
- U.S. Food and Drug Administration , Division of Imaging, Diagnostics, and Software Reliability (DIDSR), Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, 10903 New Hampshire Avenue, Building 62, Room 4126, Silver Spring, Maryland 20993, United States
| | - Qin Li
- U.S. Food and Drug Administration , Division of Imaging, Diagnostics, and Software Reliability (DIDSR), Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, 10903 New Hampshire Avenue, Building 62, Room 4126, Silver Spring, Maryland 20993, United States
| | - Rongping Zeng
- U.S. Food and Drug Administration , Division of Imaging, Diagnostics, and Software Reliability (DIDSR), Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, 10903 New Hampshire Avenue, Building 62, Room 4126, Silver Spring, Maryland 20993, United States
| | - Kyle J Myers
- U.S. Food and Drug Administration , Division of Imaging, Diagnostics, and Software Reliability (DIDSR), Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, 10903 New Hampshire Avenue, Building 62, Room 4126, Silver Spring, Maryland 20993, United States
| | - Berkman Sahiner
- U.S. Food and Drug Administration , Division of Imaging, Diagnostics, and Software Reliability (DIDSR), Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, 10903 New Hampshire Avenue, Building 62, Room 4126, Silver Spring, Maryland 20993, United States
| | - Nicholas Petrick
- U.S. Food and Drug Administration , Division of Imaging, Diagnostics, and Software Reliability (DIDSR), Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, 10903 New Hampshire Avenue, Building 62, Room 4126, Silver Spring, Maryland 20993, United States
| |
Collapse
|
41
|
Zeng R, Gavrielides MA, Petrick N, Sahiner B, Li Q, Myers KJ. Estimating local noise power spectrum from a few FBP-reconstructed CT scans. Med Phys 2016; 43:568. [DOI: 10.1118/1.4939061] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
|
42
|
Fauzi MFA, Pennell M, Sahiner B, Chen W, Shana'ah A, Hemminger J, Gru A, Kurt H, Losos M, Joehlin-Price A, Kavran C, Smith SM, Nowacki N, Mansor S, Lozanski G, Gurcan MN. Classification of follicular lymphoma: the effect of computer aid on pathologists grading. BMC Med Inform Decis Mak 2015; 15:115. [PMID: 26715518 PMCID: PMC4696238 DOI: 10.1186/s12911-015-0235-6] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Accepted: 12/15/2015] [Indexed: 11/28/2022] Open
Abstract
Background Follicular lymphoma (FL) is one of the most common lymphoid malignancies in the western world. FL cases are stratified into three histological grades based on the average centroblast count per high power field (HPF). The centroblast count is performed manually by the pathologist using an optical microscope and hematoxylin and eosin (H&E) stained tissue section. Although this is the current clinical practice, it suffers from high inter- and intra-observer variability and is vulnerable to sampling bias. Methods In this paper, we present a system, called Follicular Lymphoma Grading System (FLAGS), to assist the pathologist in grading FL cases. We also assess the effect of FLAGS on accuracy of expert and inexperienced readers. FLAGS automatically identifies possible HPFs for examination by analyzing H&E and CD20 stains, before classifying them into low or high risk categories. The pathologist is first asked to review the slides according to the current routine clinical practice, before being presented with FLAGS classification via color-coded map. The accuracy of the readers with and without FLAGS assistance is measured. Results FLAGS was used by four experts (board-certified hematopathologists) and seven pathology residents on 20 FL slides. Access to FLAGS improved overall reader accuracy with the biggest improvement seen among residents. An average AUC value of 0.75 was observed which generally indicates “acceptable” diagnostic performance. Conclusions The results of this study show that FLAGS can be useful in increasing the pathologists’ accuracy in grading the tissue. To the best of our knowledge, this study measure, for the first time, the effect of computerized image analysis on pathologists’ grading of follicular lymphoma. When fully developed, such systems have the potential to reduce sampling bias by examining an increased proportion of HPFs within follicle regions, as well as to reduce inter- and intra-reader variability. Electronic supplementary material The online version of this article (doi:10.1186/s12911-015-0235-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Michael Pennell
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, USA
| | - Berkman Sahiner
- Center for Devices and Radiological Health, US Food and Drug Administration, Silver Spring, MD, USA
| | - Weijie Chen
- Center for Devices and Radiological Health, US Food and Drug Administration, Silver Spring, MD, USA
| | - Arwa Shana'ah
- Department of Pathology, The Ohio State University, Columbus, OH, USA
| | - Jessica Hemminger
- Department of Pathology, The Ohio State University, Columbus, OH, USA
| | - Alejandro Gru
- Department of Pathology, The Ohio State University, Columbus, OH, USA
| | - Habibe Kurt
- Department of Pathology, The Ohio State University, Columbus, OH, USA
| | - Michael Losos
- Department of Pathology, The Ohio State University, Columbus, OH, USA
| | - Amy Joehlin-Price
- Department of Pathology, The Ohio State University, Columbus, OH, USA
| | - Christina Kavran
- Department of Pathology, The Ohio State University, Columbus, OH, USA
| | - Stephen M Smith
- Department of Pathology, The Ohio State University, Columbus, OH, USA
| | - Nicholas Nowacki
- Department of Pathology, The Ohio State University, Columbus, OH, USA
| | - Sharmeen Mansor
- Department of Pathology, The Ohio State University, Columbus, OH, USA
| | - Gerard Lozanski
- Department of Pathology, The Ohio State University, Columbus, OH, USA
| | - Metin N Gurcan
- Department of Biomedical Informatics, The Ohio State University, 250 Lincoln Tower, 1800 Cannon Drive, Columbus, OH, 43210, USA.
| |
Collapse
|
43
|
Abstract
The availability of large medical image datasets is critical in many applications, such as training and testing of computer-aided diagnosis systems, evaluation of segmentation algorithms, and conducting perceptual studies. However, collection of data and establishment of ground truth for medical images are both costly and difficult. To address this problem, we are developing an image blending tool that allows users to modify or supplement existing datasets by seamlessly inserting a lesion extracted from a source image into a target image. In this study, we focus on the application of this tool to pulmonary nodules in chest CT exams. We minimize the impact of user skill on the perceived quality of the composite image by limiting user involvement to two simple steps: the user first draws a casual boundary around a nodule in the source, and, then, selects the center of desired insertion area in the target. We demonstrate the performance of our system on clinical samples, and report the results of a reader study evaluating the realism of inserted nodules compared to clinical nodules. We further evaluate our image blending techniques using phantoms simulated under different noise levels and reconstruction filters. Specifically, we compute the area under the ROC curve of the Hotelling observer (HO) and noise power spectrum of regions of interest enclosing native and inserted nodules, and compare the detectability, noise texture, and noise magnitude of inserted and native nodules. Our results indicate the viability of our approach for insertion of pulmonary nodules in clinical CT images.
Collapse
Affiliation(s)
- Aria Pezeshk
- Division of Imaging, Diagnostics, and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD, USA
| | | | | | | | | | | |
Collapse
|
44
|
Li Q, Gavrielides MA, Zeng R, Myers KJ, Sahiner B, Petrick N. Volume estimation of low-contrast lesions with CT: a comparison of performances from a phantom study, simulations and theoretical analysis. Phys Med Biol 2015; 60:671-88. [PMID: 25555240 DOI: 10.1088/0031-9155/60/2/671] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Measurements of lung nodule volume with multi-detector computed tomography (MDCT) have been shown to be more accurate and precise compared to conventional lower dimensional measurements. Quantifying the size of lesions is potentially more difficult when the object-to-background contrast is low as with lesions in the liver. Physical phantom and simulation studies are often utilized to analyze the bias and variance of lesion size estimates because a ground truth or reference standard can be established. In addition, it may also be useful to derive theoretical bounds as another way of characterizing lesion sizing methods. The goal of this work was to study the performance of a MDCT system for a lesion volume estimation task with object-to-background contrast less than 50 HU, and to understand the relation among performances obtained from phantom study, simulation and theoretical analysis. We performed both phantom and simulation studies, and analyzed the bias and variance of volume measurements estimated by a matched-filter-based estimator. We further corroborated results with a theoretical analysis to estimate the achievable performance bound, which was the Cramer-Rao's lower bound (CRLB) of minimum variance for the size estimates. Results showed that estimates of non-attached solid small lesion volumes with object-to-background contrast of 31-46 HU can be accurate and precise, with less than 10.8% in percent bias and 4.8% in standard deviation of percent error (SPE), in standard dose scans. These results are consistent with theoretical (CRLB), computational (simulation) and empirical phantom bounds. The difference between the bounds is rather small (for SPE less than 1.9%) indicating that the theoretical- and simulation-based performance bounds can be good surrogates for physical phantom studies.
Collapse
Affiliation(s)
- Qin Li
- Division of Imaging, Diagnostics, and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, US Food and Drug Administration, Silver Spring, MD 20993, USA
| | | | | | | | | | | |
Collapse
|
45
|
Abstract
Receiver operating characteristic (ROC) analysis is a standard methodology to evaluate the performance of a binary classification system. The area under the ROC curve (AUC) is a performance metric that summarizes how well a classifier separates two classes. Traditional AUC optimization techniques are supervised learning methods that utilize only labeled data (i.e., the true class is known for all data) to train the classifiers. In this work, inspired by semi-supervised and transductive learning, we propose two new AUC optimization algorithms hereby referred to as semi-supervised learning receiver operating characteristic (SSLROC) algorithms, which utilize unlabeled test samples in classifier training to maximize AUC. Unlabeled samples are incorporated into the AUC optimization process, and their ranking relationships to labeled positive and negative training samples are considered as optimization constraints. The introduced test samples will cause the learned decision boundary in a multidimensional feature space to adapt not only to the distribution of labeled training data, but also to the distribution of unlabeled test data. We formulate the semi-supervised AUC optimization problem as a semi-definite programming problem based on the margin maximization theory. The proposed methods SSLROC1 (1-norm) and SSLROC2 (2-norm) were evaluated using 34 (determined by power analysis) randomly selected datasets from the University of California, Irvine machine learning repository. Wilcoxon signed rank tests showed that the proposed methods achieved significant improvement compared with state-of-the-art methods. The proposed methods were also applied to a CT colonography dataset for colonic polyp classification and showed promising results.
Collapse
Affiliation(s)
- Shijun Wang
- Imaging Biomarkers and Computer-Aided Diagnosis Lab, Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD 20892-1182, United States
| | - Diana Li
- Imaging Biomarkers and Computer-Aided Diagnosis Lab, Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD 20892-1182, United States
| | - Nicholas Petrick
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD 20993, United States
| | - Berkman Sahiner
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD 20993, United States
| | - Marius George Linguraru
- Sheikh Zayed Institute for Pediatric Surgical Innovation, Children’s National Health System, Washington, DC 20010, United States
- School of Medicine and Health Sciences, George Washington University, Washington, DC 20037, United States
| | - Ronald M. Summers
- Imaging Biomarkers and Computer-Aided Diagnosis Lab, Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD 20892-1182, United States
| |
Collapse
|
46
|
He X, Samuelson F, Zeng R, Sahiner B. Discovering intrinsic properties of human observers' visual search and mathematical observers' scanning. J Opt Soc Am A Opt Image Sci Vis 2014; 31:2495-2510. [PMID: 25401363 DOI: 10.1364/josaa.31.002495] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
There is a lack of consensus in measuring observer performance in search tasks. To pursue a consensus, we set our goal to obtain metrics that are practical, meaningful, and predictive. We consider a metric practical if it can be implemented to measure human and computer observers' performance. To be meaningful, we propose to discover intrinsic properties of search observers and formulate the metrics to characterize these properties. If the discovered properties allow verifiable predictions, we consider them predictive. We propose a theory and a conjecture toward two intrinsic properties of search observers: rationality in classification as measured by the location-known-exactly (LKE) receiver operating characteristic (ROC) curve and location uncertainty as measured by the effective set size (M*). These two properties are used to develop search models in both single-response and free-response search tasks. To confirm whether these properties are "intrinsic," we investigate their ability in predicting search performance of both human and scanning channelized Hotelling observers. In particular, for each observer, we designed experiments to measure the LKE-ROC curve and M*, which were then used to predict the same observer's performance in other search tasks. The predictions were then compared to the experimentally measured observer performance. Our results indicate that modeling the search performance using the LKE-ROC curve and M* leads to successful predictions in most cases.
Collapse
|
47
|
Zeng R, Gavrielides M, Li Q, Petrick N, Sahiner B, Myers K. WE-D-18A-06: Estimating Local Noise Power Spectrum From a Few CT Scans. Med Phys 2014. [DOI: 10.1118/1.4889415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Open
|
48
|
Abbey CK, Gallas BD, Boone JM, Niklason LT, Hadjiiski LM, Sahiner B, Samuelson FW. Comparative statistical properties of expected utility and area under the ROC curve for laboratory studies of observer performance in screening mammography. Acad Radiol 2014; 21:481-90. [PMID: 24594418 DOI: 10.1016/j.acra.2013.12.011] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2013] [Revised: 12/11/2013] [Accepted: 12/11/2013] [Indexed: 11/25/2022]
Abstract
RATIONALE AND OBJECTIVES Our objective is to determine whether expected utility (EU) and the area under the receiver operator characteristic (AUC) are consistent with one another as endpoints of observer performance studies in mammography. These two measures characterize receiver operator characteristic performance somewhat differently. We compare these two study endpoints at the level of individual reader effects, statistical inference, and components of variance across readers and cases. MATERIALS AND METHODS We reanalyze three previously published laboratory observer performance studies that investigate various x-ray breast imaging modalities using EU and AUC. The EU measure is based on recent estimates of relative utility for screening mammography. RESULTS The AUC and EU measures are correlated across readers for individual modalities (r = 0.93) and differences in modalities (r = 0.94 to 0.98). Statistical inference for modality effects based on multi-reader multi-case analysis is very similar, with significant results (P < .05) in exactly the same conditions. Power analyses show mixed results across studies, with a small increase in power on average for EU that corresponds to approximately a 7% reduction in the number of readers. Despite a large number of crossing receiver operator characteristic curves (59% of readers), modality effects only rarely have opposite signs for EU and AUC (6%). CONCLUSIONS We do not find any evidence of systematic differences between EU and AUC in screening mammography observer studies. Thus, when utility approaches are viable (i.e., an appropriate value of relative utility exists), practical effects such as statistical efficiency may be used to choose study endpoints.
Collapse
|
49
|
Petrick N, Sahiner B, Armato SG, Bert A, Correale L, Delsanto S, Freedman MT, Fryd D, Gur D, Hadjiiski L, Huo Z, Jiang Y, Morra L, Paquerault S, Raykar V, Samuelson F, Summers RM, Tourassi G, Yoshida H, Zheng B, Zhou C, Chan HP. Evaluation of computer-aided detection and diagnosis systems. Med Phys 2014; 40:087001. [PMID: 23927365 DOI: 10.1118/1.4816310] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Computer-aided detection and diagnosis (CAD) systems are increasingly being used as an aid by clinicians for detection and interpretation of diseases. Computer-aided detection systems mark regions of an image that may reveal specific abnormalities and are used to alert clinicians to these regions during image interpretation. Computer-aided diagnosis systems provide an assessment of a disease using image-based information alone or in combination with other relevant diagnostic data and are used by clinicians as a decision support in developing their diagnoses. While CAD systems are commercially available, standardized approaches for evaluating and reporting their performance have not yet been fully formalized in the literature or in a standardization effort. This deficiency has led to difficulty in the comparison of CAD devices and in understanding how the reported performance might translate into clinical practice. To address these important issues, the American Association of Physicists in Medicine (AAPM) formed the Computer Aided Detection in Diagnostic Imaging Subcommittee (CADSC), in part, to develop recommendations on approaches for assessing CAD system performance. The purpose of this paper is to convey the opinions of the AAPM CADSC members and to stimulate the development of consensus approaches and "best practices" for evaluating CAD systems. Both the assessment of a standalone CAD system and the evaluation of the impact of CAD on end-users are discussed. It is hoped that awareness of these important evaluation elements and the CADSC recommendations will lead to further development of structured guidelines for CAD performance assessment. Proper assessment of CAD system performance is expected to increase the understanding of a CAD system's effectiveness and limitations, which is expected to stimulate further research and development efforts on CAD technologies, reduce problems due to improper use, and eventually improve the utility and efficacy of CAD in clinical practice.
Collapse
Affiliation(s)
- Nicholas Petrick
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, 10903 New Hampshire Avenue, Silver Spring, Maryland 20993, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
50
|
Abstract
The goal of this work is to design computerized image analysis techniques for automatically characterizing lung nodule subtlety in CT images. Automated subtlety estimation methods may help in computer-aided detection (CAD) assessment by quantifying dataset difficulty and facilitating comparisons among different CAD algorithms. A dataset containing 813 nodules from 499 patients was obtained from the Lung Image Database Consortium. Each nodule was evaluated by four radiologists regarding nodule subtlety using a 5-point rating scale (1: most subtle). We developed a 3D technique for segmenting lung nodules using a prespecified initial ROI. Texture and morphological features were automatically extracted from the segmented nodules and their margins. The dataset was partitioned into trainers and testers using a 1:1 ratio. An artificial neural network (ANN) was trained with average reader subtlety scores as the reference. Effective features for characterizing nodule subtlety were selected based on the training set using the ANN and a stepwise feature selection method. The performance of the classifier was evaluated using prediction probability (PK) as an agreement measure, which is considered a generalization of the area under the receiver operating characteristic curve when the reference standard is multi-level. Using an ANN classifier trained with a set of 2 features (selected from a total of 30 features), including compactness and average gray value, the test concordance between computer scores and the average reader scores was 0.789 ± 0.014. Our results show that the proposed method had strong agreement with the average of subtlety scores provided by radiologists.
Collapse
Affiliation(s)
- Xin He
- US Food and Drug Administration, Center for Devices and Radiological Health, Office of Science and Engineering Laboratories, Division of Imaging and Applied Mathematics, 10903 New Hampshire Avenue Silver Spring, MD 20993, USA
| | | | | | | | | |
Collapse
|