1
|
Theilgaard Lassen J, Fly Kragh M, Rimestad J, Nygård Johansen M, Berntsen J. Development and validation of deep learning based embryo selection across multiple days of transfer. Sci Rep 2023; 13:4235. [PMID: 36918648 PMCID: PMC10015019 DOI: 10.1038/s41598-023-31136-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Accepted: 03/07/2023] [Indexed: 03/16/2023] Open
Abstract
This work describes the development and validation of a fully automated deep learning model, iDAScore v2.0, for the evaluation of human embryos incubated for 2, 3, and 5 or more days. We trained and evaluated the model on an extensive and diverse dataset including 181,428 embryos from 22 IVF clinics across the world. To discriminate the transferred embryos with known outcome, we show areas under the receiver operating curve ranging from 0.621 to 0.707 depending on the day of transfer. Predictive performance increased over time and showed a strong correlation with morphokinetic parameters. The model's performance is equivalent to the KIDScore D3 model on day 3 embryos while it significantly surpasses the performance of KIDScore D5 v3 on day 5+ embryos. This model provides an analysis of time-lapse sequences without the need for user input, and provides a reliable method for ranking embryos for their likelihood of implantation, at both cleavage and blastocyst stages. This greatly improves embryo grading consistency and saves time compared to traditional embryo evaluation methods.
Collapse
|
2
|
Kato K, Ueno S, Berntsen J, Kragh MF, Okimura T, Kuroda T. Does embryo categorization by existing artificial intelligence, morphokinetic or morphological embryo selection models correlate with blastocyst euploidy rates? Reprod Biomed Online 2023; 46:274-281. [PMID: 36470714 DOI: 10.1016/j.rbmo.2022.09.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Revised: 08/28/2022] [Accepted: 09/12/2022] [Indexed: 02/07/2023]
Abstract
RESEARCH QUESTION Does embryo categorization by existing artificial intelligence (AI), morphokinetic or morphological embryo selection models correlate with blastocyst euploidy? DESIGN A total of 834 patients (mean maternal age 40.5 ± 3.4 years) who underwent preimplantation genetic testing for aneuploidies (PGT-A) on a total of 3573 tested blastocysts were included in this retrospective study. The cycles were stratified into five maternal age groups according to the Society for Assisted Reproductive Technology age groups (<35, 35-37, 38-40, 41-42 and >42 years). The main outcome of this study was the correlation of euploidy rates in stratified maternal age groups and an automated AI model (iDAScore® v1.0), a morphokinetic embryo selection model (KIDScore Day 5 ver 3, KS-D5) and a traditional morphological grading model (Gardner criteria), respectively. RESULTS Euploidy rates were significantly correlated with iDAScore (P = 0.0035 to <0.001) in all age groups, and expect for the youngest age group, with KS-D5 and Gardner criteria (all P < 0.0001). Additionally, multivariate logistic regression analysis showed that for all models, higher scores were significantly correlated with euploidy (all P < 0.0001). CONCLUSION These results show that existing blastocyst scoring models correlate with ploidy status. However, as these models were developed to indicate implantation potential, they cannot accurately diagnose if an embryo is euploid or aneuploid. Instead, they may be used to support the decision of how many and which blastocysts to biopsy, thus potentially reducing patient costs.
Collapse
Affiliation(s)
- Keiichi Kato
- Kato Ladies Clinic, 7-20-3, Nishishinjuku, Shinjuku Tokyo 160-0023, Japan.
| | - Satoshi Ueno
- Kato Ladies Clinic, 7-20-3, Nishishinjuku, Shinjuku Tokyo 160-0023, Japan
| | | | | | - Tadashi Okimura
- Kato Ladies Clinic, 7-20-3, Nishishinjuku, Shinjuku Tokyo 160-0023, Japan
| | - Tomoko Kuroda
- Kato Ladies Clinic, 7-20-3, Nishishinjuku, Shinjuku Tokyo 160-0023, Japan
| |
Collapse
|
3
|
Berntsen J, Rimestad J, Lassen JT, Tran D, Kragh MF. Robust and generalizable embryo selection based on artificial intelligence and time-lapse image sequences. PLoS One 2022; 17:e0262661. [PMID: 35108306 PMCID: PMC8809568 DOI: 10.1371/journal.pone.0262661] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Accepted: 01/03/2022] [Indexed: 01/31/2023] Open
Abstract
Assessing and selecting the most viable embryos for transfer is an essential part of in vitro fertilization (IVF). In recent years, several approaches have been made to improve and automate the procedure using artificial intelligence (AI) and deep learning. Based on images of embryos with known implantation data (KID), AI models have been trained to automatically score embryos related to their chance of achieving a successful implantation. However, as of now, only limited research has been conducted to evaluate how embryo selection models generalize to new clinics and how they perform in subgroup analyses across various conditions. In this paper, we investigate how a deep learning-based embryo selection model using only time-lapse image sequences performs across different patient ages and clinical conditions, and how it correlates with traditional morphokinetic parameters. The model was trained and evaluated based on a large dataset from 18 IVF centers consisting of 115,832 embryos, of which 14,644 embryos were transferred KID embryos. In an independent test set, the AI model sorted KID embryos with an area under the curve (AUC) of a receiver operating characteristic curve of 0.67 and all embryos with an AUC of 0.95. A clinic hold-out test showed that the model generalized to new clinics with an AUC range of 0.60–0.75 for KID embryos. Across different subgroups of age, insemination method, incubation time, and transfer protocol, the AUC ranged between 0.63 and 0.69. Furthermore, model predictions correlated positively with blastocyst grading and negatively with direct cleavages. The fully automated iDAScore v1.0 model was shown to perform at least as good as a state-of-the-art manual embryo selection model. Moreover, full automatization of embryo scoring implies fewer manual evaluations and eliminates biases due to inter- and intraobserver variation.
Collapse
Affiliation(s)
| | | | | | - Dang Tran
- Harrison AI, Sydney, New South Wales, Australia
| | - Mikkel Fly Kragh
- Vitrolife A/S, Aarhus, Denmark
- Department of Electrical and Computer Engineering, Aarhus University, Aarhus, Denmark
| |
Collapse
|
4
|
Berntsen J, Lassen JT, Kragh MF, Rimestad J. FULL AUTOMATION OF EMBRYO EVALUATION MODELS BENEFITS FROM TRAINING ON BOTH TRANSFERRED AND DISCARDED EMBRYOS. Fertil Steril 2021. [DOI: 10.1016/j.fertnstert.2021.07.239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
5
|
Kragh MF, Lassen JT, Rimestad J, Berntsen J. O-123 Calibration of artificial intelligence (AI) models is necessary to reflect actual implantation probabilities with image-based embryo selection. Hum Reprod 2021. [DOI: 10.1093/humrep/deab126.048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Abstract
Study question
Do AI models for embryo selection provide actual implantation probabilities that generalise across clinics and patient demographics?
Summary answer
AI models need to be calibrated on representative data before providing reasonable agreements between predicted scores and actual implantation probabilities.
What is known already
AI models have been shown to perform well at discriminating embryos according to implantation likelihood, measured by area under curve (AUC). However, discrimination performance does not relate to how models perform with regards to predicting actual implantation likelihood, especially across clinics and patient demographics. In general, prediction models must be calibrated on representative data to provide meaningful probabilities. Calibration can be evaluated and summarised by “expected calibration error” (ECE) on score deciles and tested for significant lack of calibration using Hosmer-Lemeshow goodness-of-fit. ECE describes the average deviation between predicted probabilities and observed implantation rates and is 0 for perfect calibration.
Study design, size, duration
Time-lapse embryo videos from 18 clinics were used to develop AI models for prediction of fetal heartbeat (FHB). Model generalisation was evaluated on clinic hold-out models for the three largest clinics. Calibration curves were used to evaluate the agreement between AI-predicted scores and observed FHB outcome and summarised by ECE. Models were evaluated 1) without calibration, 2) calibration (Platt scaling) on other clinics’ data, and 3) calibration on the clinic’s own data (30%/70% for calibration/evaluation).
Participants/materials, setting, methods
A previously described AI algorithm, iDAScore, based on 115,842 time-lapse sequences of embryos, including 14,644 transferred embryos with known implantation data (KID), was used as foundation for training hold-out AI models for the three largest clinics (n = 2,829;2,673;1,327 KID embryos), such that their data were not included during model training. ECEs across the three clinics (mean±SD) were compared for models with/without calibration using KID embryos only, both overall and within subgroups of patient age (<36,36-40,>40 years).
Main results and the role of chance
The AUC across the three clinics was 0.675±0.041 (mean±SD) and unaffected by calibration. Without calibration, overall ECE was 0.223±0.057, indicating weak agreements between scores and actual implantation rates. With calibration on other clinics’ data, overall ECE was 0.040±0.013, indicating considerable improvements with moderate clinical variation.
As implantation probabilities are both affected by clinical practice and patient demographics, subgroup analysis was conducted on patient age (<36,36-40,>40 years). With calibration on other clinics’ data, age-group ECEs were (0.129±0.055 vs. 0.078±0.033 vs. 0.072±0.015). These calibration errors were thus larger than the overall average ECE of 0.040, indicating poor generalisation across age. Including age as input to the calibration, age-group ECEs were (0.088±0.042 vs. 0.075±0.046 vs. 0.051±0.025), indicating improved agreements between scores and implantation rates across both clinics and age groups. With calibration including age on the clinic’s own data, however, the best calibrations were obtained with ECEs (0.060±0.017 vs. 0.040±0.010 vs. 0.039±0.009). The results indicate that both clinical practice and patient demographics influence calibration and thus ideally should be adjusted for. Testing lack of calibration using Hosmer-Lemeshow goodness-of-fit, only one age-group from one clinic appeared miscalibrated (P = 0.02), whereas all other age-groups from the three clinics were appropriately calibrated (P > 0.10).
Limitations, reasons for caution
In this study, AI model calibration was conducted based on clinic and age. Other patient metadata such as BMI and patient diagnosis may be relevant to calibrate as well. However, for both calibration and evaluation on the clinic’s own data, a substantiate amount of data for each subgroup is needed.
Wider implications of the findings
With calibrated scores, AI models can predict actual implantation likelihood for each embryo. Probability estimates are a strong tool for patient communication and clinical decisions such as deciding when to discard/freeze embryos. Model calibration may thus be the next step in improving clinical outcome and shortening time to live birth.
Trial registration number
This work is partly funded by the Innovation Fund Denmark (IFD) under File No. 7039-00068B and partly funded by Vitrolife A/S
Collapse
Affiliation(s)
- M F Kragh
- Vitrolife A/S & Aarhus University, Product Development, Viby J, Denmark
| | - J T Lassen
- Vitrolife A/S, Product Development, Viby J, Denmark
| | - J Rimestad
- Vitrolife A/S, Product Development, Viby J, Denmark
| | - J Berntsen
- Vitrolife A/S, Product Development, Viby J, Denmark
| |
Collapse
|
6
|
Abstract
Embryo selection within in vitro fertilization (IVF) is the process of evaluating qualities of fertilized oocytes (embryos) and selecting the best embryo(s) available within a patient cohort for subsequent transfer or cryopreservation. In recent years, artificial intelligence (AI) has been used extensively to improve and automate the embryo ranking and selection procedure by extracting relevant information from embryo microscopy images. The AI models are evaluated based on their ability to identify the embryo(s) with the highest chance(s) of achieving a successful pregnancy. Whether such evaluations should be based on ranking performance or pregnancy prediction, however, seems to divide studies. As such, a variety of performance metrics are reported, and comparisons between studies are often made on different outcomes and data foundations. Moreover, superiority of AI methods over manual human evaluation is often claimed based on retrospective data, without any mentions of potential bias. In this paper, we provide a technical view on some of the major topics that divide how current AI models are trained, evaluated and compared. We explain and discuss the most common evaluation metrics and relate them to the two separate evaluation objectives, ranking and prediction. We also discuss when and how to compare AI models across studies and explain in detail how a selection bias is inevitable when comparing AI models against current embryo selection practice in retrospective cohort studies.
Collapse
Affiliation(s)
- Mikkel Fly Kragh
- Department of Electrical and Computer Engineering, Aarhus University, Aarhus N, Denmark.
- Vitrolife A/S, Viby J, Denmark.
| | - Henrik Karstoft
- Department of Electrical and Computer Engineering, Aarhus University, Aarhus N, Denmark
| |
Collapse
|
7
|
Berntsen J, Rimestad J, Lassen JT, Kragh MF. OPENING THE BLACK BOX: RELATION BETWEEN AI PREDICTED EMBRYO IMPLANTATION AND TRADITIONAL MORPHOKINETIC AND MORPHOLOGICAL ANNOTATIONS. Fertil Steril 2020. [DOI: 10.1016/j.fertnstert.2020.08.231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|