1
|
Johansen MN, Parner ET, Kragh MF, Kato K, Ueno S, Palm S, Kernbach M, Balaban B, Keleş İ, Gabrielsen AV, Iversen LH, Berntsen J. Comparing performance between clinics of an embryo evaluation algorithm based on time-lapse images and machine learning. J Assist Reprod Genet 2023; 40:2129-2137. [PMID: 37423932 PMCID: PMC10440335 DOI: 10.1007/s10815-023-02871-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Accepted: 06/20/2023] [Indexed: 07/11/2023] Open
Abstract
PURPOSE This article aims to assess how differences in maternal age distributions between IVF clinics affect the performance of an artificial intelligence model for embryo viability prediction and proposes a method to account for such differences. METHODS Using retrospectively collected data from 4805 fresh and frozen single blastocyst transfers of embryos incubated for 5 to 6 days, the discriminative performance was assessed based on fetal heartbeat outcomes. The data was collected from 4 clinics, and the discrimination was measured in terms of the area under ROC curves (AUC) for each clinic. To account for the different age distributions between clinics, a method for age-standardizing the AUCs was developed in which the clinic-specific AUCs were standardized using weights for each embryo according to the relative frequency of the maternal age in the relevant clinic compared to the age distribution in a common reference population. RESULTS There was substantial variation in the clinic-specific AUCs with estimates ranging from 0.58 to 0.69 before standardization. The age-standardization of the AUCs reduced the between-clinic variance by 16%. Most notably, three of the clinics had quite similar AUCs after standardization, while the last clinic had a markedly lower AUC both with and without standardization. CONCLUSION The method of using age-standardization of the AUCs that is proposed in this article mitigates some of the variability between clinics. This enables a comparison of clinic-specific AUCs where the difference in age distributions is accounted for.
Collapse
Affiliation(s)
| | - Erik T Parner
- Section for Biostatistics, Department of Public Health, Aarhus University, Aarhus, Denmark
| | - Mikkel F Kragh
- Vitrolife A/S, Jens Juuls Vej 18-20, 8260, Viby J, Denmark
- The AI Lab Aps, Aarhus, Denmark
| | | | | | | | | | | | - İpek Keleş
- Koc University Hospital, Istanbul, Turkey
| | | | - Lea H Iversen
- Fertility Clinic, Horsens Regional Hospital, Horsens, Denmark
| | | |
Collapse
|
2
|
Kragh MF, Rimestad J, Lassen JT, Berntsen J, Karstoft H. Predicting Embryo Viability Based on Self-Supervised Alignment of Time-Lapse Videos. IEEE Trans Med Imaging 2022; 41:465-475. [PMID: 34596537 DOI: 10.1109/tmi.2021.3116986] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
With self-supervised learning, both labeled and unlabeled data can be used for representation learning and model pretraining. This is particularly relevant when automating the selection of a patient's fertilized eggs (embryos) during a fertility treatment, in which only the embryos that were transferred to the female uterus may have labels of pregnancy. In this paper, we apply a self-supervised video alignment method known as temporal cycle-consistency (TCC) on 38176 time-lapse videos of developing embryos, of which 14550 were labeled. We show how TCC can be used to extract temporal similarities between embryo videos and use these for predicting pregnancy likelihood. Our temporal similarity method outperforms the time alignment measurement (TAM) with an area under the receiver operating characteristic (AUC) of 0.64 vs. 0.56. Compared to existing embryo evaluation models, it places in between a pure temporal and a spatio-temporal model that both require manual annotations. Furthermore, we use TCC for transfer learning in a semi-supervised fashion and show significant performance improvements compared to standard supervised learning, when only a small subset of the dataset is labeled. Specifically, two variants of transfer learning both achieve an AUC of 0.66 compared to 0.63 for supervised learning when 16% of the dataset is labeled.
Collapse
|
3
|
Kragh MF, Lassen JT, Rimestad J, Berntsen J. O-123 Calibration of artificial intelligence (AI) models is necessary to reflect actual implantation probabilities with image-based embryo selection. Hum Reprod 2021. [DOI: 10.1093/humrep/deab126.048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Abstract
Study question
Do AI models for embryo selection provide actual implantation probabilities that generalise across clinics and patient demographics?
Summary answer
AI models need to be calibrated on representative data before providing reasonable agreements between predicted scores and actual implantation probabilities.
What is known already
AI models have been shown to perform well at discriminating embryos according to implantation likelihood, measured by area under curve (AUC). However, discrimination performance does not relate to how models perform with regards to predicting actual implantation likelihood, especially across clinics and patient demographics. In general, prediction models must be calibrated on representative data to provide meaningful probabilities. Calibration can be evaluated and summarised by “expected calibration error” (ECE) on score deciles and tested for significant lack of calibration using Hosmer-Lemeshow goodness-of-fit. ECE describes the average deviation between predicted probabilities and observed implantation rates and is 0 for perfect calibration.
Study design, size, duration
Time-lapse embryo videos from 18 clinics were used to develop AI models for prediction of fetal heartbeat (FHB). Model generalisation was evaluated on clinic hold-out models for the three largest clinics. Calibration curves were used to evaluate the agreement between AI-predicted scores and observed FHB outcome and summarised by ECE. Models were evaluated 1) without calibration, 2) calibration (Platt scaling) on other clinics’ data, and 3) calibration on the clinic’s own data (30%/70% for calibration/evaluation).
Participants/materials, setting, methods
A previously described AI algorithm, iDAScore, based on 115,842 time-lapse sequences of embryos, including 14,644 transferred embryos with known implantation data (KID), was used as foundation for training hold-out AI models for the three largest clinics (n = 2,829;2,673;1,327 KID embryos), such that their data were not included during model training. ECEs across the three clinics (mean±SD) were compared for models with/without calibration using KID embryos only, both overall and within subgroups of patient age (<36,36-40,>40 years).
Main results and the role of chance
The AUC across the three clinics was 0.675±0.041 (mean±SD) and unaffected by calibration. Without calibration, overall ECE was 0.223±0.057, indicating weak agreements between scores and actual implantation rates. With calibration on other clinics’ data, overall ECE was 0.040±0.013, indicating considerable improvements with moderate clinical variation.
As implantation probabilities are both affected by clinical practice and patient demographics, subgroup analysis was conducted on patient age (<36,36-40,>40 years). With calibration on other clinics’ data, age-group ECEs were (0.129±0.055 vs. 0.078±0.033 vs. 0.072±0.015). These calibration errors were thus larger than the overall average ECE of 0.040, indicating poor generalisation across age. Including age as input to the calibration, age-group ECEs were (0.088±0.042 vs. 0.075±0.046 vs. 0.051±0.025), indicating improved agreements between scores and implantation rates across both clinics and age groups. With calibration including age on the clinic’s own data, however, the best calibrations were obtained with ECEs (0.060±0.017 vs. 0.040±0.010 vs. 0.039±0.009). The results indicate that both clinical practice and patient demographics influence calibration and thus ideally should be adjusted for. Testing lack of calibration using Hosmer-Lemeshow goodness-of-fit, only one age-group from one clinic appeared miscalibrated (P = 0.02), whereas all other age-groups from the three clinics were appropriately calibrated (P > 0.10).
Limitations, reasons for caution
In this study, AI model calibration was conducted based on clinic and age. Other patient metadata such as BMI and patient diagnosis may be relevant to calibrate as well. However, for both calibration and evaluation on the clinic’s own data, a substantiate amount of data for each subgroup is needed.
Wider implications of the findings
With calibrated scores, AI models can predict actual implantation likelihood for each embryo. Probability estimates are a strong tool for patient communication and clinical decisions such as deciding when to discard/freeze embryos. Model calibration may thus be the next step in improving clinical outcome and shortening time to live birth.
Trial registration number
This work is partly funded by the Innovation Fund Denmark (IFD) under File No. 7039-00068B and partly funded by Vitrolife A/S
Collapse
Affiliation(s)
- M F Kragh
- Vitrolife A/S & Aarhus University, Product Development, Viby J, Denmark
| | - J T Lassen
- Vitrolife A/S, Product Development, Viby J, Denmark
| | - J Rimestad
- Vitrolife A/S, Product Development, Viby J, Denmark
| | - J Berntsen
- Vitrolife A/S, Product Development, Viby J, Denmark
| |
Collapse
|