1
|
Xing M, Zhang Y, Yu H, Yang Z, Li X, Li Q, Zhao Y, Zhao Z, Luo Y. Predict DLBCL patients' recurrence within two years with Gaussian mixture model cluster oversampling and multi-kernel learning. Comput Methods Programs Biomed 2022; 226:107103. [PMID: 36088813 DOI: 10.1016/j.cmpb.2022.107103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Revised: 08/05/2022] [Accepted: 08/30/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND AND OBJECTIVE Diffuse large B-cell lymphoma (DLBCL) is common in adults' non-Hodgkin's lymphoma. Relapse mainly occurs within two years after diagnosis and has a poor prognosis. Relapse after two years is less frequent and has a better prognosis. In this work, we constructed a relapse prediction model for diffuse large B-cell lymphoma patients within two years, expecting to provide a reference for Clinicians to implement individualized treatment. METHOD We propose a secondary-level class imbalance method based on Gaussian mixture model (GMM) clustering resampling to balance the data. Then use a multi-kernel support vector machine(SVM) to inscribe heterogeneous clinical data. Finally, merging them to identify recurrence patients within two years. RESULTS Among all the class imbalance methods in this work, Inverse Weighted -GMM +SMOTEENN has the best performance. Compared with NO-GMM (Directl use the SMOTEENN without the GMM clustering process), its Area Under the ROC Curve(AUC) increases by 8.75%, and ECE and brier scores decrease 2.07% and 3.09%, respectively. Among the four classification algorithms in this work, Multiple kernel learning (MKL) has the most minimized brier scores and expected calibration error(ECE), the largest AUC, accuracy, Recall, precision and F1, has the best discrimination and calibration. CONCLUSION Our inverse weighted -GMM+SMOTEENN+MKL (GMM-SENN-MKL) method can handle data class imbalance and clinical heterogeneity data well and can be used to predict recurrence in DLBCL patients.
Collapse
Affiliation(s)
- Meng Xing
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Yanbo Zhang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Hongmei Yu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Zhenhuan Yang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Xueling Li
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Qiong Li
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Yanlin Zhao
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Zhiqiang Zhao
- Department of Hematology, Shanxi Cancer Hospital, Taiyuan, China.
| | - Yanhong Luo
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China.
| |
Collapse
|
2
|
Zhao N, Zhuo M, Tian K, Gong X. Protein-protein interaction and non-interaction predictions using gene sequence natural vector. Commun Biol 2022; 5:652. [PMID: 35780196 DOI: 10.1038/s42003-022-03617-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 06/21/2022] [Indexed: 12/02/2022] Open
Abstract
Predicting protein–protein interaction and non-interaction are two important different aspects of multi-body structure predictions, which provide vital information about protein function. Some computational methods have recently been developed to complement experimental methods, but still cannot effectively detect real non-interacting protein pairs. We proposed a gene sequence-based method, named NVDT (Natural Vector combine with Dinucleotide and Triplet nucleotide), for the prediction of interaction and non-interaction. For protein–protein non-interactions (PPNIs), the proposed method obtained accuracies of 86.23% for Homo sapiens and 85.34% for Mus musculus, and it performed well on three types of non-interaction networks. For protein-protein interactions (PPIs), we obtained accuracies of 99.20, 94.94, 98.56, 95.41, and 94.83% for Saccharomyces cerevisiae, Drosophila melanogaster, Helicobacter pylori, Homo sapiens, and Mus musculus, respectively. Furthermore, NVDT outperformed established sequence-based methods and demonstrated high prediction results for cross-species interactions. NVDT is expected to be an effective approach for predicting PPIs and PPNIs. Protein-protein non-interactions and interactions are distinguished and predicted by gene sequence using single nucleotide and contiguous nucleotides combined with machine learning models.
Collapse
|