1
|
Gao Y, Cui Y. Optimizing clinico-genomic disease prediction across ancestries: a machine learning strategy with Pareto improvement. Genome Med 2024; 16:76. [PMID: 38835075 PMCID: PMC11149372 DOI: 10.1186/s13073-024-01345-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 05/17/2024] [Indexed: 06/06/2024] Open
Abstract
BACKGROUND Accurate prediction of an individual's predisposition to diseases is vital for preventive medicine and early intervention. Various statistical and machine learning models have been developed for disease prediction using clinico-genomic data. However, the accuracy of clinico-genomic prediction of diseases may vary significantly across ancestry groups due to their unequal representation in clinical genomic datasets. METHODS We introduced a deep transfer learning approach to improve the performance of clinico-genomic prediction models for data-disadvantaged ancestry groups. We conducted machine learning experiments on multi-ancestral genomic datasets of lung cancer, prostate cancer, and Alzheimer's disease, as well as on synthetic datasets with built-in data inequality and distribution shifts across ancestry groups. RESULTS Deep transfer learning significantly improved disease prediction accuracy for data-disadvantaged populations in our multi-ancestral machine learning experiments. In contrast, transfer learning based on linear frameworks did not achieve comparable improvements for these data-disadvantaged populations. CONCLUSIONS This study shows that deep transfer learning can enhance fairness in multi-ancestral machine learning by improving prediction accuracy for data-disadvantaged populations without compromising prediction accuracy for other populations, thus providing a Pareto improvement towards equitable clinico-genomic prediction of diseases.
Collapse
Affiliation(s)
- Yan Gao
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA
- Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA
| | - Yan Cui
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA.
- Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA.
- Center for Cancer Research, University of Tennessee Health Science Center, Memphis, TN, 38163, USA.
| |
Collapse
|
3
|
Park JH, Choi JY, Lee J, Kyung M. Bayesian Approach to Multivariate Component-Based Logistic Regression: Analyzing Correlated Multivariate Ordinal Data. MULTIVARIATE BEHAVIORAL RESEARCH 2022; 57:543-560. [PMID: 33523709 DOI: 10.1080/00273171.2021.1874260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Applications of component-based models have gained much attention as a means of accompanying dimension reduction in the regression setting and have been successfully implemented to model a univariate outcome in the behavioral and social sciences. Despite the prevalence of correlated ordinal outcome data in the fields, however, most of the extant component-based models have been extended to address the multivariate ordinal issue with a simplified but unrealistic assumption of independence, which may lead to biased statistical inferences. Thus, we propose a Bayesian methodology for a component-based model that accounts for unstructured residual covariances, while regressing multivariate ordinal outcomes on pre-defined sets of predictors. The proposed Bayesian multivariate ordinal logistic model re-expresses ordinal outcomes of interest with a set of latent continuous variables based on an approximate multivariate t-distribution. This contributes not only to developing an efficient Gibbs sampler, a Markov Chain Monte Carlo algorithm, but also to facilitating the interpretation of regression coefficients as log-transformed odds ratio. The empirical utility of the proposed method is demonstrated through analyzing a subset of data, extracted from the 2009 to 2010 Health Behavior in School-Aged Children study that investigates risk factors of four different forms of bullying perpetration and victimization: physical, social, racial, and cyber.
Collapse
Affiliation(s)
| | | | - Jungup Lee
- Department of Social Work, National University of Singapore
| | - Minjung Kyung
- Department of Statistics, Duksung Women's University
| |
Collapse
|
5
|
Morais-Rodrigues F, Silv Erio-Machado R, Kato RB, Rodrigues DLN, Valdez-Baez J, Fonseca V, San EJ, Gomes LGR, Dos Santos RG, Vinicius Canário Viana M, da Cruz Ferraz Dutra J, Teixeira Dornelles Parise M, Parise D, Campos FF, de Souza SJ, Ortega JM, Barh D, Ghosh P, Azevedo VAC, Dos Santos MA. Analysis of the microarray gene expression for breast cancer progression after the application modified logistic regression. Gene 2019; 726:144168. [PMID: 31759986 DOI: 10.1016/j.gene.2019.144168] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Revised: 09/21/2019] [Accepted: 10/11/2019] [Indexed: 01/02/2023]
Abstract
Methods based around statistics and linear algebra have been increasingly used in attempts to address emerging questions in microarray literature. Microarray technology is a long-used tool in the global analysis of gene expression, allowing for the simultaneous investigation of hundreds or thousands of genes in a sample. It is characterized by a low sample size and a large feature number created a non-square matrix, and by the incomplete rank, that can generate countless more solution in classifiers. To avoid the problem of the 'curse of dimensionality' many authors have performed feature selection or reduced the size of data matrix. In this work, we introduce a new logistic regression-based model to classify breast cancer tumor samples based on microarray expression data, including all features of gene expression and without reducing the microarray data matrix. If the user still deems it necessary to perform feature reduction, it can be done after the application of the methodology, still maintaining a good classification. This methodology allowed the correct classification of breast cancer sample data sets from Gene Expression Omnibus (GEO) data series GSE65194, GSE20711, and GSE25055, which contain the microarray data of said breast cancer samples. Classification had a minimum performance of 80% (sensitivity and specificity), and explored all possible data combinations, including breast cancer subtypes. This methodology highlighted genes not yet studied in breast cancer, some of which have been observed in Gene Regulatory Networks (GRNs). In this work we examine the patterns and features of a GRN composed of transcription factors (TFs) in MCF-7 breast cancer cell lines, providing valuable information regarding breast cancer. In particular, some genes whose αi ∗ associated parameter values revealed extreme positive and negative values, and, as such, can be identified as breast cancer prediction genes. We indicate that the PKN2, MKL1, MED23, CUL5 and GLI genes demonstrate a tumor suppressor profile, and that the MTR, ITGA2B, TELO2, MRPL9, MTTL1, WIPI1, KLHL20, PI4KB, FOLR1 and SHC1 genes demonstrate an oncogenic profile. We propose that these may serve as potential breast cancer prediction genes, and should be prioritized for further clinical studies on breast cancer. This new model allows for the assignment of values to the αi ∗ parameters associated with gene expression. It was noted that some αi ∗ parameters are associated with genes previously described as breast cancer biomarkers, as well as other genes not yet studied in relation to this disease.
Collapse
Affiliation(s)
- Francielly Morais-Rodrigues
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil.
| | - Rita Silv Erio-Machado
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil
| | - Rodrigo Bentes Kato
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil
| | - Diego Lucas Neres Rodrigues
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil
| | - Juan Valdez-Baez
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil
| | - Vagner Fonseca
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil; KwaZulu-Natal Research Innovation and Sequencing Platform (KRISP), College of Health Sciences, University of KwaZulu-Natal, Durban 4001, South Africa
| | - Emmanuel James San
- KwaZulu-Natal Research Innovation and Sequencing Platform (KRISP), College of Health Sciences, University of KwaZulu-Natal, Durban 4001, South Africa
| | - Lucas Gabriel Rodrigues Gomes
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil
| | - Roselane Gonçalves Dos Santos
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil
| | - Marcus Vinicius Canário Viana
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil; Federal University of Pará, UFPA, Brazil
| | - Joyce da Cruz Ferraz Dutra
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil
| | - Mariana Teixeira Dornelles Parise
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil
| | - Doglas Parise
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil
| | - Frederico F Campos
- Department of Computer Science, Federal University of Minas Gerais, Brazil Av Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil
| | | | - José Miguel Ortega
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil
| | - Debmalya Barh
- Centre for Genomics and Applied Gene Technology, Institute of Integrative Omics and Applied Biotechnology (IIOAB), Nonakuri, Purba Medinipur, West Bengal 721172, India
| | - Preetam Ghosh
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Vasco A C Azevedo
- Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil
| | - Marcos A Dos Santos
- Department of Computer Science, Federal University of Minas Gerais, Brazil Av Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil
| |
Collapse
|
6
|
López de Maturana E, Alonso L, Alarcón P, Martín-Antoniano IA, Pineda S, Piorno L, Calle ML, Malats N. Challenges in the Integration of Omics and Non-Omics Data. Genes (Basel) 2019; 10:genes10030238. [PMID: 30897838 PMCID: PMC6471713 DOI: 10.3390/genes10030238] [Citation(s) in RCA: 60] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Revised: 03/05/2019] [Accepted: 03/14/2019] [Indexed: 11/16/2022] Open
Abstract
Omics data integration is already a reality. However, few omics-based algorithms show enough predictive ability to be implemented into clinics or public health domains. Clinical/epidemiological data tend to explain most of the variation of health-related traits, and its joint modeling with omics data is crucial to increase the algorithm’s predictive ability. Only a small number of published studies performed a “real” integration of omics and non-omics (OnO) data, mainly to predict cancer outcomes. Challenges in OnO data integration regard the nature and heterogeneity of non-omics data, the possibility of integrating large-scale non-omics data with high-throughput omics data, the relationship between OnO data (i.e., ascertainment bias), the presence of interactions, the fairness of the models, and the presence of subphenotypes. These challenges demand the development and application of new analysis strategies to integrate OnO data. In this contribution we discuss different attempts of OnO data integration in clinical and epidemiological studies. Most of the reviewed papers considered only one type of omics data set, mainly RNA expression data. All selected papers incorporated non-omics data in a low-dimensionality fashion. The integrative strategies used in the identified papers adopted three modeling methods: Independent, conditional, and joint modeling. This review presents, discusses, and proposes integrative analytical strategies towards OnO data integration.
Collapse
Affiliation(s)
- Evangelina López de Maturana
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Lola Alonso
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Pablo Alarcón
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Isabel Adoración Martín-Antoniano
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Silvia Pineda
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Lucas Piorno
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - M Luz Calle
- Biosciences Department, University of Vic-Central University of Catalonia, Carrer de la Laura 13, 08570 Vic, Spain.
| | - Núria Malats
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| |
Collapse
|