1
|
Tafavvoghi M, Bongo LA, Shvetsov N, Busund LTR, Møllersen K. Publicly available datasets of breast histopathology H&E whole-slide images: A scoping review. J Pathol Inform 2024; 15:100363. [PMID: 38405160 PMCID: PMC10884505 DOI: 10.1016/j.jpi.2024.100363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 11/24/2023] [Accepted: 01/23/2024] [Indexed: 02/27/2024] Open
Abstract
Advancements in digital pathology and computing resources have made a significant impact in the field of computational pathology for breast cancer diagnosis and treatment. However, access to high-quality labeled histopathological images of breast cancer is a big challenge that limits the development of accurate and robust deep learning models. In this scoping review, we identified the publicly available datasets of breast H&E-stained whole-slide images (WSIs) that can be used to develop deep learning algorithms. We systematically searched 9 scientific literature databases and 9 research data repositories and found 17 publicly available datasets containing 10 385 H&E WSIs of breast cancer. Moreover, we reported image metadata and characteristics for each dataset to assist researchers in selecting proper datasets for specific tasks in breast cancer computational pathology. In addition, we compiled 2 lists of breast H&E patches and private datasets as supplementary resources for researchers. Notably, only 28% of the included articles utilized multiple datasets, and only 14% used an external validation set, suggesting that the performance of other developed models may be susceptible to overestimation. The TCGA-BRCA was used in 52% of the selected studies. This dataset has a considerable selection bias that can impact the robustness and generalizability of the trained algorithms. There is also a lack of consistent metadata reporting of breast WSI datasets that can be an issue in developing accurate deep learning models, indicating the necessity of establishing explicit guidelines for documenting breast WSI dataset characteristics and metadata.
Collapse
Affiliation(s)
- Masoud Tafavvoghi
- Department of Community Medicine, Uit The Arctic University of Norway, Tromsø, Norway
| | - Lars Ailo Bongo
- Department of Computer Science, Uit The Arctic University of Norway, Tromsø, Norway
| | - Nikita Shvetsov
- Department of Computer Science, Uit The Arctic University of Norway, Tromsø, Norway
| | | | - Kajsa Møllersen
- Department of Community Medicine, Uit The Arctic University of Norway, Tromsø, Norway
| |
Collapse
|
2
|
Irmakci I, Nateghi R, Zhou R, Vescovo M, Saft M, Ross AE, Yang XJ, Cooper LAD, Goldstein JA. Tissue Contamination Challenges the Credibility of Machine Learning Models in Real World Digital Pathology. Mod Pathol 2024; 37:100422. [PMID: 38185250 PMCID: PMC10960671 DOI: 10.1016/j.modpat.2024.100422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 11/13/2023] [Accepted: 12/15/2023] [Indexed: 01/09/2024]
Abstract
Machine learning (ML) models are poised to transform surgical pathology practice. The most successful use attention mechanisms to examine whole slides, identify which areas of tissue are diagnostic, and use them to guide diagnosis. Tissue contaminants, such as floaters, represent unexpected tissue. Although human pathologists are extensively trained to consider and detect tissue contaminants, we examined their impact on ML models. We trained 4 whole-slide models. Three operate in placenta for the following functions: (1) detection of decidual arteriopathy, (2) estimation of gestational age, and (3) classification of macroscopic placental lesions. We also developed a model to detect prostate cancer in needle biopsies. We designed experiments wherein patches of contaminant tissue are randomly sampled from known slides and digitally added to patient slides and measured model performance. We measured the proportion of attention given to contaminants and examined the impact of contaminants in the t-distributed stochastic neighbor embedding feature space. Every model showed performance degradation in response to one or more tissue contaminants. Decidual arteriopathy detection--balanced accuracy decreased from 0.74 to 0.69 ± 0.01 with addition of 1 patch of prostate tissue for every 100 patches of placenta (1% contaminant). Bladder, added at 10% contaminant, raised the mean absolute error in estimating gestational age from 1.626 weeks to 2.371 ± 0.003 weeks. Blood, incorporated into placental sections, induced false-negative diagnoses of intervillous thrombi. Addition of bladder to prostate cancer needle biopsies induced false positives, a selection of high-attention patches, representing 0.033 mm2, and resulted in a 97% false-positive rate when added to needle biopsies. Contaminant patches received attention at or above the rate of the average patch of patient tissue. Tissue contaminants induce errors in modern ML models. The high level of attention given to contaminants indicates a failure to encode biological phenomena. Practitioners should move to quantify and ameliorate this problem.
Collapse
Affiliation(s)
- Ismail Irmakci
- Department of Pathology, Northwestern University, Feinberg School of Medicine, Chicago, Illinois
| | - Ramin Nateghi
- Department of Pathology, Northwestern University, Feinberg School of Medicine, Chicago, Illinois
| | - Rujoi Zhou
- Department of Pathology, Northwestern University, Feinberg School of Medicine, Chicago, Illinois
| | - Mariavittoria Vescovo
- Department of Pathology, Northwestern University, Feinberg School of Medicine, Chicago, Illinois
| | - Madeline Saft
- Department of Pathology, Northwestern University, Feinberg School of Medicine, Chicago, Illinois
| | - Ashley E Ross
- Department of Pathology, Northwestern University, Feinberg School of Medicine, Chicago, Illinois
| | - Ximing J Yang
- Department of Pathology, Northwestern University, Feinberg School of Medicine, Chicago, Illinois
| | - Lee A D Cooper
- Department of Pathology, Northwestern University, Feinberg School of Medicine, Chicago, Illinois
| | - Jeffery A Goldstein
- Department of Pathology, Northwestern University, Feinberg School of Medicine, Chicago, Illinois.
| |
Collapse
|
3
|
Irmakci I, Nateghi R, Zhou R, Ross AE, Yang XJ, Cooper LAD, Goldstein JA. Tissue contamination challenges the credibility of machine learning models in real world digital pathology. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.04.28.23289287. [PMID: 37205404 PMCID: PMC10187357 DOI: 10.1101/2023.04.28.23289287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Machine learning (ML) models are poised to transform surgical pathology practice. The most successful use attention mechanisms to examine whole slides, identify which areas of tissue are diagnostic, and use them to guide diagnosis. Tissue contaminants, such as floaters, represent unexpected tissue. While human pathologists are extensively trained to consider and detect tissue contaminants, we examined their impact on ML models. We trained 4 whole slide models. Three operate in placenta for 1) detection of decidual arteriopathy (DA), 2) estimation of gestational age (GA), and 3) classification of macroscopic placental lesions. We also developed a model to detect prostate cancer in needle biopsies. We designed experiments wherein patches of contaminant tissue are randomly sampled from known slides and digitally added to patient slides and measured model performance. We measured the proportion of attention given to contaminants and examined the impact of contaminants in T-distributed Stochastic Neighbor Embedding (tSNE) feature space. Every model showed performance degradation in response to one or more tissue contaminants. DA detection balanced accuracy decreased from 0.74 to 0.69 +/- 0.01 with addition of 1 patch of prostate tissue for every 100 patches of placenta (1% contaminant). Bladder, added at 10% contaminant raised the mean absolute error in estimating gestation age from 1.626 weeks to 2.371 +/ 0.003 weeks. Blood, incorporated into placental sections, induced false negative diagnoses of intervillous thrombi. Addition of bladder to prostate cancer needle biopsies induced false positives, a selection of high-attention patches, representing 0.033mm2, resulted in a 97% false positive rate when added to needle biopsies. Contaminant patches received attention at or above the rate of the average patch of patient tissue. Tissue contaminants induce errors in modern ML models. The high level of attention given to contaminants indicates a failure to encode biological phenomena. Practitioners should move to quantify and ameliorate this problem.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Jeffery A. Goldstein
- To whom correspondence should be addressed: Olson 2-455, 710 N. Fairbanks Ave, Chicago IL, 60611,
| |
Collapse
|
4
|
Roberson J, Cuda JM, Davis Floyd AD, McGrath CM, Russell DK, Wendel-Spiczka A, VandenBussche CJ, Reynolds JP. Cross-contamination in cytology processing: a review of current practice. J Am Soc Cytopathol 2022; 11:194-200. [PMID: 35610099 DOI: 10.1016/j.jasc.2022.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 02/16/2022] [Accepted: 03/07/2022] [Indexed: 06/15/2023]
Abstract
INTRODUCTION New cytopreparatory technologies decrease the need for direct smears in favor of an increased use of liquid-based cytology methods. Despite these practice changes, Clinical Laboratory Improvement Amendments continue to require that cytopathology laboratories have procedures to prevent cross-contamination (CC). While the incidence of CC is not well documented, specific cytologic preparations and specimens with a high potential for CC have not been generally defined by professional guidelines or consensus. The American Society of Cytopathology Clinical Practice Committee surveyed cytology practitioners to better understand current practice related to CC in cytology. MATERIALS AND METHODS The survey focused on four topics: (1) practice settings and demographic data; (2) current practice for meeting CC requirements; (3) practice for rapid on-site evaluation; and (4) preparation types considered high risk for CC. The survey was sent to all American Society of Cytopathology and American Society for Cytotechnology members from July 1 to August 14, 2020. RESULTS Ninety-eight percent of laboratories had a written CC policy, with 66.18% of the policies addressing rapid on-site evaluation CC procedures. Documented cases of CC were rare. Alcohol-fixed, direct smears of Pap-stained fluids were deemed the most likely to be impacted by CC. Cell block contamination during the histologic processing were reported by 56.20% of respondents. CONCLUSIONS Changes in practice has resulted in decreased preparation types associated with a high potential for CC. Laboratories should follow a risk-based approach to define these cases. Knowledge of practice patterns among laboratories can guide the development and refinement of policy and procedures.
Collapse
Affiliation(s)
- Janie Roberson
- Department of Pathology, University of Alabama-Birmingham, Birmingham, Alabama
| | - Jacqueline M Cuda
- Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania
| | | | - Cindy M McGrath
- Department of Pathology, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Donna K Russell
- Department of Pathology, University of Rochester Medical Center, Rochester, New York
| | | | | | - Jordan P Reynolds
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Jacksonville, Florida.
| |
Collapse
|
5
|
Foran DJ, Durbin EB, Chen W, Sadimin E, Sharma A, Banerjee I, Kurc T, Li N, Stroup AM, Harris G, Gu A, Schymura M, Gupta R, Bremer E, Balsamo J, DiPrima T, Wang F, Abousamra S, Samaras D, Hands I, Ward K, Saltz JH. An Expandable Informatics Framework for Enhancing Central Cancer Registries with Digital Pathology Specimens, Computational Imaging Tools, and Advanced Mining Capabilities. J Pathol Inform 2022; 13:5. [PMID: 35136672 PMCID: PMC8794027 DOI: 10.4103/jpi.jpi_31_21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 04/30/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Population-based state cancer registries are an authoritative source for cancer statistics in the United States. They routinely collect a variety of data, including patient demographics, primary tumor site, stage at diagnosis, first course of treatment, and survival, on every cancer case that is reported across all U.S. states and territories. The goal of our project is to enrich NCI's Surveillance, Epidemiology, and End Results (SEER) registry data with high-quality population-based biospecimen data in the form of digital pathology, machine-learning-based classifications, and quantitative histopathology imaging feature sets (referred to here as Pathomics features). MATERIALS AND METHODS As part of the project, the underlying informatics infrastructure was designed, tested, and implemented through close collaboration with several participating SEER registries to ensure consistency with registry processes, computational scalability, and ability to support creation of population cohorts that span multiple sites. Utilizing computational imaging algorithms and methods to both generate indices and search for matches makes it possible to reduce inter- and intra-observer inconsistencies and to improve the objectivity with which large image repositories are interrogated. RESULTS Our team has created and continues to expand a well-curated repository of high-quality digitized pathology images corresponding to subjects whose data are routinely collected by the collaborating registries. Our team has systematically deployed and tested key, visual analytic methods to facilitate automated creation of population cohorts for epidemiological studies and tools to support visualization of feature clusters and evaluation of whole-slide images. As part of these efforts, we are developing and optimizing advanced search and matching algorithms to facilitate automated, content-based retrieval of digitized specimens based on their underlying image features and staining characteristics. CONCLUSION To meet the challenges of this project, we established the analytic pipelines, methods, and workflows to support the expansion and management of a growing repository of high-quality digitized pathology and information-rich, population cohorts containing objective imaging and clinical attributes to facilitate studies that seek to discriminate among different subtypes of disease, stratify patient populations, and perform comparisons of tumor characteristics within and across patient cohorts. We have also successfully developed a suite of tools based on a deep-learning method to perform quantitative characterizations of tumor regions, assess infiltrating lymphocyte distributions, and generate objective nuclear feature measurements. As part of these efforts, our team has implemented reliable methods that enable investigators to systematically search through large repositories to automatically retrieve digitized pathology specimens and correlated clinical data based on their computational signatures.
Collapse
Affiliation(s)
- David J. Foran
- Center for Biomedical Informatics, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, USA
- Department of Pathology and Laboratory Medicine, Rutgers-Robert Wood Johnson Medical School, Piscataway, NJ, USA
| | - Eric B. Durbin
- Kentucky Cancer Registry, Markey Cancer Center, University of Kentucky, Lexington, KY, USA
- Division of Biomedical Informatics, Department of Internal Medicine, College of Medicine, Lexington, KY, USA
| | - Wenjin Chen
- Center for Biomedical Informatics, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, USA
| | - Evita Sadimin
- Center for Biomedical Informatics, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, USA
- Department of Pathology and Laboratory Medicine, Rutgers-Robert Wood Johnson Medical School, Piscataway, NJ, USA
| | - Ashish Sharma
- Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA, USA
| | - Imon Banerjee
- Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA, USA
| | - Tahsin Kurc
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
| | - Nan Li
- Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA, USA
| | - Antoinette M. Stroup
- New Jersey State Cancer Registry, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, USA
| | - Gerald Harris
- New Jersey State Cancer Registry, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, USA
| | - Annie Gu
- Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA, USA
| | - Maria Schymura
- New York State Cancer Registry, New York State Department of Health, Albany, NY, USA
| | - Rajarsi Gupta
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
| | - Erich Bremer
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
| | - Joseph Balsamo
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
| | - Tammy DiPrima
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
| | - Feiqiao Wang
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
| | - Shahira Abousamra
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Dimitris Samaras
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Isaac Hands
- Division of Biomedical Informatics, Department of Internal Medicine, College of Medicine, Lexington, KY, USA
| | - Kevin Ward
- Georgia State Cancer Registry, Georgia Department of Public Health, Atlanta, GA, USA
| | - Joel H. Saltz
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
| |
Collapse
|
6
|
Egevad L, Delahunt B, Samaratunga H, Tsuzuki T, Yamamoto Y, Yaxley J, Ruusuvuori P, Kartasalo K, Eklund M. The emerging role of artificial intelligence in the reporting of prostate pathology. Pathology 2021; 53:565-567. [PMID: 34108086 DOI: 10.1016/j.pathol.2021.04.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Accepted: 04/28/2021] [Indexed: 11/18/2022]
Affiliation(s)
- Lars Egevad
- Department of Oncology and Pathology, Karolinska Institutet, Stockholm, Sweden.
| | - Brett Delahunt
- Department of Pathology and Molecular Medicine, Wellington School of Medicine and Health Sciences, University of Otago, Wellington, New Zealand
| | | | - Toyonori Tsuzuki
- Department of Surgical Pathology, Aichi Medical University, School of Medicine, Nagoya, Japan
| | - Yoichiro Yamamoto
- Pathology Informatics Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
| | - John Yaxley
- Wesley Urology Clinic, Brisbane, Qld, Australia
| | - Pekka Ruusuvuori
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland; Institute of Biomedicine, University of Turku, Turku, Finland
| | - Kimmo Kartasalo
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland; Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Martin Eklund
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|