1
|
Yao T, Lu Y, Long J, Jha A, Zhu Z, Asad Z, Yang H, Fogo AB, Huo Y. Glo-In-One: holistic glomerular detection, segmentation, and lesion characterization with large-scale web image mining. J Med Imaging (Bellingham) 2022; 9:052408. [PMID: 35747553 PMCID: PMC9207519 DOI: 10.1117/1.jmi.9.5.052408] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2021] [Accepted: 05/31/2022] [Indexed: 11/14/2022] Open
Abstract
Purpose: The quantitative detection, segmentation, and characterization of glomeruli from high-resolution whole slide imaging (WSI) play essential roles in the computer-assisted diagnosis and scientific research in digital renal pathology. Historically, such comprehensive quantification requires extensive programming skills to be able to handle heterogeneous and customized computational tools. To bridge the gap of performing glomerular quantification for non-technical users, we develop the Glo-In-One toolkit to achieve holistic glomerular detection, segmentation, and characterization via a single line of command. Additionally, we release a large-scale collection of 30,000 unlabeled glomerular images to further facilitate the algorithmic development of self-supervised deep learning. Approach: The inputs of the Glo-In-One toolkit are WSIs, while the outputs are (1) WSI-level multi-class circle glomerular detection results (which can be directly manipulated with ImageScope), (2) glomerular image patches with segmentation masks, and (3) different lesion types. In the current version, the fine-grained global glomerulosclerosis (GGS) characterization is provided, including assessed-solidified-GSS (associated with hypertension-related injury), disappearing-GSS (a further end result of the SGGS becoming contiguous with fibrotic interstitium), and obsolescent-GSS (nonspecific GGS increasing with aging) glomeruli. To leverage the performance of the Glo-In-One toolkit, we introduce self-supervised deep learning to glomerular quantification via large-scale web image mining. Results: The GGS fine-grained classification model achieved a decent performance compared with baseline supervised methods while only using 10% of the annotated data. The glomerular detection achieved an average precision of 0.627 with circle representations, while the glomerular segmentation achieved a 0.955 patch-wise Dice dimilarity coefficient. Conclusion: We develop and release an open-source Glo-In-One toolkit, a software with holistic glomerular detection, segmentation, and lesion characterization. This toolkit is user-friendly to non-technical users via a single line of command. The toolbox and the 30,000 web mined glomerular images have been made publicly available at https://github.com/hrlblab/Glo-In-One.
Collapse
Affiliation(s)
- Tianyuan Yao
- Vanderbilt University, Department of Computer Science, Nashville, Tennessee, United States
| | - Yuzhe Lu
- Vanderbilt University, Department of Computer Science, Nashville, Tennessee, United States
| | - Jun Long
- Central South University, Big Data Institute, Changsha, China
| | - Aadarsh Jha
- Vanderbilt University, Department of Computer Science, Nashville, Tennessee, United States
| | - Zheyu Zhu
- Vanderbilt University, Department of Computer Science, Nashville, Tennessee, United States
| | - Zuhayr Asad
- Vanderbilt University, Department of Computer Science, Nashville, Tennessee, United States
| | - Haichun Yang
- Vanderbilt University Medical Center, Department of Pathology, Microbiology and Immunology, Nashville, Tennessee, United States
| | - Agnes B. Fogo
- Vanderbilt University Medical Center, Department of Pathology, Microbiology and Immunology, Nashville, Tennessee, United States
| | - Yuankai Huo
- Vanderbilt University, Department of Computer Science, Nashville, Tennessee, United States
| |
Collapse
|
2
|
Galli G, Sabadin F, Yassue RM, Galves C, Carvalho HF, Crossa J, Montesinos-López OA, Fritsche-Neto R. Automated Machine Learning: A Case Study of Genomic "Image-Based" Prediction in Maize Hybrids. FRONTIERS IN PLANT SCIENCE 2022; 13:845524. [PMID: 35321444 PMCID: PMC8936805 DOI: 10.3389/fpls.2022.845524] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Accepted: 02/03/2022] [Indexed: 06/14/2023]
Abstract
Machine learning methods such as multilayer perceptrons (MLP) and Convolutional Neural Networks (CNN) have emerged as promising methods for genomic prediction (GP). In this context, we assess the performance of MLP and CNN on regression and classification tasks in a case study with maize hybrids. The genomic information was provided to the MLP as a relationship matrix and to the CNN as "genomic images." In the regression task, the machine learning models were compared along with GBLUP. Under the classification task, MLP and CNN were compared. In this case, the traits (plant height and grain yield) were discretized in such a way to create balanced (moderate selection intensity) and unbalanced (extreme selection intensity) datasets for further evaluations. An automatic hyperparameter search for MLP and CNN was performed, and the best models were reported. For both task types, several metrics were calculated under a validation scheme to assess the effect of the prediction method and other variables. Overall, MLP and CNN presented competitive results to GBLUP. Also, we bring new insights on automated machine learning for genomic prediction and its implications to plant breeding.
Collapse
Affiliation(s)
- Giovanni Galli
- Department of Genetics, Luiz de Queiroz College of Agriculture, University of São Paulo, Piracicaba, Brazil
| | - Felipe Sabadin
- School of Plant and Environmental Sciences, Virginia Tech, Blacksburg, VA, United States
| | - Rafael Massahiro Yassue
- Department of Genetics, Luiz de Queiroz College of Agriculture, University of São Paulo, Piracicaba, Brazil
| | - Cassia Galves
- Department of Food Engineering, University of Saskatchewan, Saskatoon, SK, Canada
| | | | - Jose Crossa
- International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico
| | | | - Roberto Fritsche-Neto
- Department of Genetics, Luiz de Queiroz College of Agriculture, University of São Paulo, Piracicaba, Brazil
- International Rice Research Institute (IRRI), Los Baños, Philippines
| |
Collapse
|
3
|
Abstract
Background and Objective:
Colorectal cancer (CRC) is a common malignant tumor of
the digestive system; it is associated with high morbidity and mortality. However, an early
prediction of colorectal adenoma (CRA) that is a precancerous disease of most CRC patients
provides an opportunity to make an appropriate strategy for prevention, early diagnosis and
treatment. It has been aimed to develop a machine learning model to predict CRA that could assist
physicians in classifying high-risk patients, make informed choices and prevent CRC.
Methods:
Patients who had undergone a colonoscopy to fill out a questionnaire at the Sixth People
Hospital of Shanghai in China from July 2018 to November 2018 were instructed. A classification
model with the gradient boosting decision tree (GBDT) was developed to predict CRA. This
model was compared with three other models, namely, random forest (RF), support vector
machine (SVM), and logistic regression (LR). The area under the receiver operating characteristic
curve (AUC) was used to evaluate performance of the models.
Results:
Among the 245 included patients, 65 patients had CRA. The area under the receiver
operating characteristic (AUCs) of GBDT, RF, SVM ,and LR with 10 fold-cross validation was
0.8131, 0.74, 0.769 and 0.763. An online prediction service, CRA Inference System, to
substantialize the proposed solution for patients with CRA was also built.
Conclusion:
Four classification models for CRA prediction were developed and compared, and
the GBDT model showed the highest performance. Implementing a GBDT model for screening
can reduce the cost of time and money and help physicians identify high-risk groups for primary
prevention.
Collapse
Affiliation(s)
- Junbo Gao
- Information Engineering College, Shanghai Maritime University, Shanghai, China
| | - Lifeng Zhang
- Information Engineering College, Shanghai Maritime University, Shanghai, China
| | - Gaiqing Yu
- Information Engineering College, Shanghai Maritime University, Shanghai, China
| | - Guoqiang Qu
- Department of Gastroenterology, Eastern Hospital, Shanghai Sixth People Hospital, Shanghai, China
| | - Yanfeng Li
- Information Engineering College, Shanghai Maritime University, Shanghai, China
| | | |
Collapse
|
4
|
|
5
|
Lopez-Garcia P, Masegosa AD, Osaba E, Onieva E, Perallos A. Ensemble classification for imbalanced data based on feature space partitioning and hybrid metaheuristics. APPL INTELL 2019. [DOI: 10.1007/s10489-019-01423-6] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
6
|
Park E, Chang HJ, Nam HS. A Bayesian Network Model for Predicting Post-stroke Outcomes With Available Risk Factors. Front Neurol 2018; 9:699. [PMID: 30245663 PMCID: PMC6137617 DOI: 10.3389/fneur.2018.00699] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Accepted: 08/02/2018] [Indexed: 11/13/2022] Open
Abstract
Bayesian network is an increasingly popular method in modeling uncertain and complex problems, because its interpretability is often more useful than plain prediction. To satisfy the core requirement in medical research to obtain interpretable prediction with high accuracy, we constructed an inference engine for post-stroke outcomes based on Bayesian network classifiers. The prediction system that was trained on data of 3,605 patients with acute stroke forecasts the functional independence at 3 months and the mortality 1 year after stroke. Feature selection methods were applied to eliminate less relevant and redundant features from 76 risk variables. The Bayesian network classifiers were trained with a hill-climbing searching for the qualified network structure and parameters measured by maximum description length. We evaluated and optimized the proposed system to increase the area under the receiver operating characteristic curve (AUC) while ensuring acceptable sensitivity for the class-imbalanced data. The performance evaluation demonstrated that the Bayesian network with selected features by wrapper-type feature selection can predict 3-month functional independence with an AUC of 0.889 using only 19 risk variables and 1-year mortality with an AUC of 0.893 using 24 variables. The Bayesian network with 50 features filtered by information gain can predict 3-month functional independence with an AUC of 0.875 and 1-year mortality with an AUC of 0.895. We also built an online prediction service, Yonsei Stroke Outcome Inference System, to substantialize the proposed solution for patients with stroke.
Collapse
Affiliation(s)
- Eunjeong Park
- Cardiovascular Research Institute, College of Medicine, Yonsei University, Seoul, South Korea
| | - Hyuk-Jae Chang
- Department of Cardiology, College of Medicine, Yonsei University, Seoul, South Korea
| | - Hyo Suk Nam
- Department of Neurology, College of Medicine, Yonsei University, Seoul, South Korea
| |
Collapse
|
7
|
Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data. J Intell Inf Syst 2017. [DOI: 10.1007/s10844-017-0446-7] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
8
|
Valero-Mas JJ, Calvo-Zaragoza J, Rico-Juan JR, Iñesta JM. A Study of Prototype Selection Algorithms for Nearest Neighbour in Class-Imbalanced Problems. PATTERN RECOGNITION AND IMAGE ANALYSIS 2017. [DOI: 10.1007/978-3-319-58838-4_37] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
9
|
Komori O, Eguchi S, Ikeda S, Okamura H, Ichinokawa M, Nakayama S. An asymmetric logistic regression model for ecological data. Methods Ecol Evol 2015. [DOI: 10.1111/2041-210x.12473] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Osamu Komori
- Department of Electrical and Electronics Engineering University of Fukui Bunkyo 3‐9‐1 Fukui 910‐8507 Japan
| | - Shinto Eguchi
- The Institute of Statistical Mathematics Midori‐cho 10‐3, Tachikawa Tokyo 190‐8562 Japan
| | - Shiro Ikeda
- The Institute of Statistical Mathematics Midori‐cho 10‐3, Tachikawa Tokyo 190‐8562 Japan
| | - Hiroshi Okamura
- National Research Institute of Fisheries Science Fisheries Research Agency 2‐12‐4 Fukuura, Kanazawa Yokohama Kanagawa 236‐8648Japan
| | - Momoko Ichinokawa
- National Research Institute of Fisheries Science Fisheries Research Agency 2‐12‐4 Fukuura, Kanazawa Yokohama Kanagawa 236‐8648Japan
| | - Shinichiro Nakayama
- National Research Institute of Fisheries Science Fisheries Research Agency 2‐12‐4 Fukuura, Kanazawa Yokohama Kanagawa 236‐8648Japan
| |
Collapse
|
10
|
Ornella L, Pérez P, Tapia E, González-Camacho JM, Burgueño J, Zhang X, Singh S, Vicente FS, Bonnett D, Dreisigacker S, Singh R, Long N, Crossa J. Genomic-enabled prediction with classification algorithms. Heredity (Edinb) 2014; 112:616-26. [PMID: 24424163 PMCID: PMC4023444 DOI: 10.1038/hdy.2013.144] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2013] [Revised: 12/05/2013] [Accepted: 12/09/2013] [Indexed: 11/09/2022] Open
Abstract
Pearson's correlation coefficient (ρ) is the most commonly reported metric of the success of prediction in genomic selection (GS). However, in real breeding ρ may not be very useful for assessing the quality of the regression in the tails of the distribution, where individuals are chosen for selection. This research used 14 maize and 16 wheat data sets with different trait–environment combinations. Six different models were evaluated by means of a cross-validation scheme (50 random partitions each, with 90% of the individuals in the training set and 10% in the testing set). The predictive accuracy of these algorithms for selecting individuals belonging to the best α=10, 15, 20, 25, 30, 35, 40% of the distribution was estimated using Cohen's kappa coefficient (κ) and an ad hoc measure, which we call relative efficiency (RE), which indicates the expected genetic gain due to selection when individuals are selected based on GS exclusively. We put special emphasis on the analysis for α=15%, because it is a percentile commonly used in plant breeding programmes (for example, at CIMMYT). We also used ρ as a criterion for overall success. The algorithms used were: Bayesian LASSO (BL), Ridge Regression (RR), Reproducing Kernel Hilbert Spaces (RHKS), Random Forest Regression (RFR), and Support Vector Regression (SVR) with linear (lin) and Gaussian kernels (rbf). The performance of regression methods for selecting the best individuals was compared with that of three supervised classification algorithms: Random Forest Classification (RFC) and Support Vector Classification (SVC) with linear (lin) and Gaussian (rbf) kernels. Classification methods were evaluated using the same cross-validation scheme but with the response vector of the original training sets dichotomised using a given threshold. For α=15%, SVC-lin presented the highest κ coefficients in 13 of the 14 maize data sets, with best values ranging from 0.131 to 0.722 (statistically significant in 9 data sets) and the best RE in the same 13 data sets, with values ranging from 0.393 to 0.948 (statistically significant in 12 data sets). RR produced the best mean for both κ and RE in one data set (0.148 and 0.381, respectively). Regarding the wheat data sets, SVC-lin presented the best κ in 12 of the 16 data sets, with outcomes ranging from 0.280 to 0.580 (statistically significant in 4 data sets) and the best RE in 9 data sets ranging from 0.484 to 0.821 (statistically significant in 5 data sets). SVC-rbf (0.235), RR (0.265) and RHKS (0.422) gave the best κ in one data set each, while RHKS and BL tied for the last one (0.234). Finally, BL presented the best RE in two data sets (0.738 and 0.750), RFR (0.636) and SVC-rbf (0.617) in one and RHKS in the remaining three (0.502, 0.458 and 0.586). The difference between the performance of SVC-lin and that of the rest of the models was not so pronounced at higher percentiles of the distribution. The behaviour of regression and classification algorithms varied markedly when selection was done at different thresholds, that is, κ and RE for each algorithm depended strongly on the selection percentile. Based on the results, we propose classification method as a promising alternative for GS in plant breeding.
Collapse
Affiliation(s)
- L Ornella
- French-Argentine International Center for Information and Systems Sciences (CIFASIS), Rosario, Argentina
| | - P Pérez
- Colegio de Postgraduados, Montecillo, Edo. de México, México DF, Mexico
| | - E Tapia
- French-Argentine International Center for Information and Systems Sciences (CIFASIS), Rosario, Argentina
| | | | - J Burgueño
- Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), México DF, Mexico
| | - X Zhang
- Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), México DF, Mexico
| | - S Singh
- Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), México DF, Mexico
| | - F S Vicente
- Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), México DF, Mexico
| | - D Bonnett
- Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), México DF, Mexico
| | - S Dreisigacker
- Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), México DF, Mexico
| | - R Singh
- Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), México DF, Mexico
| | - N Long
- Center for Human Genome Variation, Duke University School of Medicine, Durham, NC, USA
| | - J Crossa
- Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), México DF, Mexico
| |
Collapse
|
11
|
Brain tumor detection and segmentation in a CRF (conditional random fields) framework with pixel-pairwise affinity and superpixel-level features. Int J Comput Assist Radiol Surg 2013; 9:241-53. [DOI: 10.1007/s11548-013-0922-7] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2013] [Accepted: 07/03/2013] [Indexed: 01/10/2023]
|
12
|
|
13
|
A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets. Knowl Based Syst 2013. [DOI: 10.1016/j.knosys.2012.08.025] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
14
|
Millán-Giraldo M, García V, Sánchez JS. Instance Selection Methods and Resampling Techniques for Dissimilarity Representation with Imbalanced Data Sets. ACTA ACUST UNITED AC 2013. [DOI: 10.1007/978-3-642-36530-0_12] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/20/2023]
|
15
|
The quest for the optimal class distribution: an approach for enhancing the effectiveness of learning via resampling methods for imbalanced data sets. PROGRESS IN ARTIFICIAL INTELLIGENCE 2012. [DOI: 10.1007/s13748-012-0034-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
16
|
Identification of Different Types of Minority Class Examples in Imbalanced Data. LECTURE NOTES IN COMPUTER SCIENCE 2012. [DOI: 10.1007/978-3-642-28931-6_14] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|