1
|
Ghislat G, Hernandez-Hernandez S, Piyawajanusorn C, Ballester PJ. Data-centric challenges with the application and adoption of artificial intelligence for drug discovery. Expert Opin Drug Discov 2024; 19:1297-1307. [PMID: 39316009 DOI: 10.1080/17460441.2024.2403639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Accepted: 09/09/2024] [Indexed: 09/25/2024]
Abstract
INTRODUCTION Artificial intelligence (AI) is exhibiting tremendous potential to reduce the massive costs and long timescales of drug discovery. There are however important challenges currently limiting the impact and scope of AI models. AREAS COVERED In this perspective, the authors discuss a range of data issues (bias, inconsistency, skewness, irrelevance, small size, high dimensionality), how they challenge AI models, and which issue-specific mitigations have been effective. Next, they point out the challenges faced by uncertainty quantification techniques aimed at enhancing and trusting the predictions from these AI models. They also discuss how conceptual errors, unrealistic benchmarks and performance misestimation can confound the evaluation of models and thus their development. Lastly, the authors explain how human bias, whether from AI experts or drug discovery experts, constitutes another challenge that can be alleviated by gaining more prospective experience. EXPERT OPINION AI models are often developed to excel on retrospective benchmarks unlikely to anticipate their prospective performance. As a result, only a few of these models are ever reported to have prospective value (e.g. by discovering potent and innovative drug leads for a therapeutic target). The authors have discussed what can go wrong in practice with AI for drug discovery. The authors hope that this will help inform the decisions of editors, funders investors, and researchers working in this area.
Collapse
Affiliation(s)
- Ghita Ghislat
- Department of Life Sciences, Imperial College London, London, UK
| | | | | | | |
Collapse
|
2
|
Gagliardi I, Campolo F, Borges de Souza P, Rossi L, Albertelli M, Grillo F, Caputi L, Mazza M, Faggiano A, Zatelli MC. Comparative Targeted Genome Profiling between Solid and Liquid Biopsies in Gastroenteropancreatic Neuroendocrine Neoplasms: A Proof-of-Concept Pilot Study. Neuroendocrinology 2024:1-12. [PMID: 39447548 DOI: 10.1159/000541346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Accepted: 06/19/2024] [Indexed: 10/26/2024]
Abstract
INTRODUCTION Clinical presentation and genetic profile of gastroenteropancreatic neuroendocrine tumors (GEP-NETs) are highly variable, hampering their management. Sequencing of circulating tumor DNA from liquid biopsy (LB) has been proposed as a less invasive alternative to solid biopsy (SB). Our aim was to compare the mutational profile (MP) provided by LB with that deriving from SB in GEP-NETs. METHODS SB and LB were derived simultaneously from 6 GEP-NET patients. A comparative targeted next-generation sequencing (NGS) analysis was performed on DNA from SB and LB to evaluate the mutational status of 11 genes (MEN1, DAXX, ATRX, MUTYH, SETD2, DEPDC5, TSC2, ARID1A, CHECK2, MTOR, and PTEN). RESULTS Patients (M:F = 2:1; median age 64 years) included 3 with pancreatic and 3 with ileal NETs. NGS detected a median number of 55 variants/sample in SB and 66.5 variants/sample in LB specimens (mutational burden: 0.2-1.9 and 0.3-1.8 mut/Mb, respectively). Missense and nonsense mutations were prevalent in both, mainly represented by C>T transitions. ARID1A, MTOR, and ATRX were consistently mutated in SB, and ARID1A, TSC2, MEN1, PTEN, SETD2, and MUTYH were consistently mutated in LB. DAXX mutations were absent in LB. Seventeen recurrent mutations were shared between SB and LB; in particular, MTOR single-nucleotide variants c.G4731A and c.C2997T were shared by 5 out of 6 patients. Hierarchical clustering supported genetic similarity between SB and LB. CONCLUSIONS This pilot study explores the applicability of LB in GEP-NET MP evaluation. Further studies with larger cohorts are needed to validate LB and to define the clinical impact.
Collapse
Affiliation(s)
- Irene Gagliardi
- Section of Endocrinology and Internal Medicine, Department of Medical Sciences, University of Ferrara, Ferrara, Italy
| | - Federica Campolo
- Department of Experimental Medicine, Sapienza University of Rome, Rome, Italy
| | | | - Lucrezia Rossi
- Section of Endocrinology and Internal Medicine, Department of Medical Sciences, University of Ferrara, Ferrara, Italy
| | - Manuela Albertelli
- Endocrinology, Department of Internal Medicine and Medical Specialties (DiMI), University of Genova, Genova, Italy
- Endocrinology Unit, IRCCS Ospedale Policlinico San Martino, Genova, Italy
| | - Federica Grillo
- Anatomic Pathology, Department of Surgical Sciences and Integrated Diagnostics (DISC), University of Genova, Genova, Italy
- Anatomic Pathology, IRCCS Ospedale Policlinico San Martino, Genova, Italy
| | - Luigi Caputi
- Freelancer - Independent Researcher, Naples, Italy
| | - Massimiliano Mazza
- IRCCS Istituto Romagnolo per lo Studio dei Tumori (IRST) "Dino Amadori", Meldola, Italy
| | - Antongiulio Faggiano
- Endocrinology Unit, Department of Clinical and Molecular Medicine, Sant'Andrea Hospital, Sapienza University of Rome, Rome, Italy
| | - Maria Chiara Zatelli
- Section of Endocrinology and Internal Medicine, Department of Medical Sciences, University of Ferrara, Ferrara, Italy,
| |
Collapse
|
3
|
Lee Y, Cappellato M, Di Camillo B. Machine learning-based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease. Gigascience 2022; 12:giad083. [PMID: 37882604 PMCID: PMC10600917 DOI: 10.1093/gigascience/giad083] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 08/23/2023] [Accepted: 09/17/2023] [Indexed: 10/27/2023] Open
Abstract
BACKGROUND Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning-based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. RESULTS We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray-Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. CONCLUSION Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.
Collapse
Affiliation(s)
- Youngro Lee
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, 08826, Korea
- Institute of Engineering Research at Seoul National University, Seoul, 08826, Korea
| | - Marco Cappellato
- Department of Information Engineering, University of Padova, Padova, 35122, Italy
| | - Barbara Di Camillo
- Department of Information Engineering, University of Padova, Padova, 35122, Italy
| |
Collapse
|
4
|
White BS, Khan SA, Mason MJ, Ammad-Ud-Din M, Potdar S, Malani D, Kuusanmäki H, Druker BJ, Heckman C, Kallioniemi O, Kurtz SE, Porkka K, Tognon CE, Tyner JW, Aittokallio T, Wennerberg K, Guinney J. Bayesian multi-source regression and monocyte-associated gene expression predict BCL-2 inhibitor resistance in acute myeloid leukemia. NPJ Precis Oncol 2021; 5:71. [PMID: 34302041 PMCID: PMC8302655 DOI: 10.1038/s41698-021-00209-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Accepted: 06/22/2021] [Indexed: 11/09/2022] Open
Abstract
The FDA recently approved eight targeted therapies for acute myeloid leukemia (AML), including the BCL-2 inhibitor venetoclax. Maximizing efficacy of these treatments requires refining patient selection. To this end, we analyzed two recent AML studies profiling the gene expression and ex vivo drug response of primary patient samples. We find that ex vivo samples often exhibit a general sensitivity to (any) drug exposure, independent of drug target. We observe that this "general response across drugs" (GRD) is associated with FLT3-ITD mutations, clinical response to standard induction chemotherapy, and overall survival. Further, incorporating GRD into expression-based regression models trained on one of the studies improved their performance in predicting ex vivo response in the second study, thus signifying its relevance to precision oncology efforts. We find that venetoclax response is independent of GRD but instead show that it is linked to expression of monocyte-associated genes by developing and applying a multi-source Bayesian regression approach. The method shares information across studies to robustly identify biomarkers of drug response and is broadly applicable in integrative analyses.
Collapse
Affiliation(s)
- Brian S White
- Computational Oncology, Sage Bionetworks, Seattle, WA, USA.
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA.
| | - Suleiman A Khan
- Institute for Molecular Medicine Finland (FIMM), Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
| | - Mike J Mason
- Computational Oncology, Sage Bionetworks, Seattle, WA, USA
| | - Muhammad Ammad-Ud-Din
- Institute for Molecular Medicine Finland (FIMM), Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
| | - Swapnil Potdar
- Institute for Molecular Medicine Finland (FIMM), Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
| | - Disha Malani
- Institute for Molecular Medicine Finland (FIMM), Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
| | - Heikki Kuusanmäki
- Institute for Molecular Medicine Finland (FIMM), Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
- Biotech Research & Innovation Centre (BRIC) and Novo Nordisk Foundation Center for Stem Cell Biology (DanStem), University of Copenhagen, Copenhagen, Denmark
| | - Brian J Druker
- Howard Hughes Medical Institute, Portland, OR, USA
- Division of Hematology and Medical Oncology, Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
| | - Caroline Heckman
- Institute for Molecular Medicine Finland (FIMM), Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
| | - Olli Kallioniemi
- Institute for Molecular Medicine Finland (FIMM), Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
- Scilifelab, Karolinska Institute, Solna, Sweden
| | - Stephen E Kurtz
- Division of Hematology and Medical Oncology, Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
| | - Kimmo Porkka
- HUS Comprehensive Cancer Center, Hematology Research Unit Helsinki and iCAN Digital Precision Cancer Center Medicine Flagship, University of Helsinki, Helsinki, Finland
| | - Cristina E Tognon
- Howard Hughes Medical Institute, Portland, OR, USA
- Division of Hematology and Medical Oncology, Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
| | - Jeffrey W Tyner
- Division of Hematology and Medical Oncology, Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
| | - Tero Aittokallio
- Institute for Molecular Medicine Finland (FIMM), Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
- Department of Mathematics and Statistics, University of Turku, Turku, Finland
- Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo, Norway
- Centre for Biostatistics and Epidemiology (OCBE), University of Oslo, Oslo, Norway
| | - Krister Wennerberg
- Institute for Molecular Medicine Finland (FIMM), Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
- Biotech Research & Innovation Centre (BRIC) and Novo Nordisk Foundation Center for Stem Cell Biology (DanStem), University of Copenhagen, Copenhagen, Denmark
| | - Justin Guinney
- Computational Oncology, Sage Bionetworks, Seattle, WA, USA
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
| |
Collapse
|
5
|
Ye Z, Ke H, Chen S, Cruz-Cano R, He X, Zhang J, Dorgan J, Milton DK, Ma T. Biomarker Categorization in Transcriptomic Meta-Analysis by Concordant Patterns With Application to Pan-Cancer Studies. Front Genet 2021; 12:651546. [PMID: 34276766 PMCID: PMC8283696 DOI: 10.3389/fgene.2021.651546] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Accepted: 05/28/2021] [Indexed: 01/21/2023] Open
Abstract
With the increasing availability and dropping cost of high-throughput technology in recent years, many-omics datasets have accumulated in the public domain. Combining multiple transcriptomic studies on related hypothesis via meta-analysis can improve statistical power and reproducibility over single studies. For differential expression (DE) analysis, biomarker categorization by DE pattern across studies is a natural but critical task following biomarker detection to help explain between study heterogeneity and classify biomarkers into categories with potentially related functionality. In this paper, we propose a novel meta-analysis method to categorize biomarkers by simultaneously considering the concordant pattern and the biological and statistical significance across studies. Biomarkers with the same DE pattern can be analyzed together in downstream pathway enrichment analysis. In the presence of different types of transcripts (e.g., mRNA, miRNA, and lncRNA, etc.), integrative analysis including miRNA/lncRNA target enrichment analysis and miRNA-mRNA and lncRNA-mRNA causal regulatory network analysis can be conducted jointly on all the transcripts of the same category. We applied our method to two Pan-cancer transcriptomic study examples with single or multiple types of transcripts available. Targeted downstream analysis identified categories of biomarkers with unique functionality and regulatory relationships that motivate new hypothesis in Pan-cancer analysis.
Collapse
Affiliation(s)
- Zhenyao Ye
- Department of Epidemiology and Biostatistics, School of Public Health, University of Maryland, College Park, College Park, MD, United States
| | - Hongjie Ke
- Department of Epidemiology and Biostatistics, School of Public Health, University of Maryland, College Park, College Park, MD, United States
| | - Shuo Chen
- Department of Epidemiology and Public Health, School of Medicine, University of Maryland, Baltimore, Baltimore, MD, United States
| | - Raul Cruz-Cano
- Department of Epidemiology and Biostatistics, School of Public Health, University of Maryland, College Park, College Park, MD, United States
| | - Xin He
- Department of Epidemiology and Biostatistics, School of Public Health, University of Maryland, College Park, College Park, MD, United States
| | - Jing Zhang
- Department of Epidemiology and Biostatistics, School of Public Health, University of Maryland, College Park, College Park, MD, United States
| | - Joanne Dorgan
- Department of Epidemiology and Public Health, School of Medicine, University of Maryland, Baltimore, Baltimore, MD, United States
| | - Donald K Milton
- Maryland Institute for Applied Environmental Health, School of Public Health, University of Maryland, College Park, College Park, MD, United States
| | - Tianzhou Ma
- Department of Epidemiology and Biostatistics, School of Public Health, University of Maryland, College Park, College Park, MD, United States
| |
Collapse
|
6
|
Comin M, Di Camillo B, Pizzi C, Vandin F. Comparison of microbiome samples: methods and computational challenges. Brief Bioinform 2020; 22:88-95. [PMID: 32577746 DOI: 10.1093/bib/bbaa121] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2019] [Revised: 05/09/2020] [Accepted: 05/18/2020] [Indexed: 12/14/2022] Open
Abstract
The study of microbial communities crucially relies on the comparison of metagenomic next-generation sequencing data sets, for which several methods have been designed in recent years. Here, we review three key challenges in the comparison of such data sets: species identification and quantification, the efficient computation of distances between metagenomic samples and the identification of metagenomic features associated with a phenotype such as disease status. We present current solutions for such challenges, considering both reference-based methods relying on a database of reference genomes and reference-free methods working directly on all sequencing reads from the samples.
Collapse
|
7
|
Brichetto G, Monti Bragadin M, Fiorini S, Battaglia MA, Konrad G, Ponzio M, Pedullà L, Verri A, Barla A, Tacchino A. The hidden information in patient-reported outcomes and clinician-assessed outcomes: multiple sclerosis as a proof of concept of a machine learning approach. Neurol Sci 2019; 41:459-462. [PMID: 31659583 PMCID: PMC7005074 DOI: 10.1007/s10072-019-04093-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2019] [Accepted: 09/28/2019] [Indexed: 11/30/2022]
Abstract
Machine learning (ML) applied to patient-reported (PROs) and clinical-assessed outcomes (CAOs) could favour a more predictive and personalized medicine. Our aim was to confirm the important role of applying ML to PROs and CAOs of people with relapsing-remitting (RR) and secondary progressive (SP) form of multiple sclerosis (MS), to promptly identifying information useful to predict disease progression. For our analysis, a dataset of 3398 evaluations from 810 persons with MS (PwMS) was adopted. Three steps were provided: course classification; extraction of the most relevant predictors at the next time point; prediction if the patient will experience the transition from RR to SP at the next time point. The Current Course Assignment (CCA) step correctly assigned the current MS course with an accuracy of about 86.0%. The MS course at the next time point can be predicted using the predictors selected in CCA. PROs/CAOs Evolution Prediction (PEP) followed by Future Course Assignment (FCA) was able to foresee the course at the next time point with an accuracy of 82.6%. Our results suggest that PROs and CAOs could help the clinician decision-making in their practice.
Collapse
Affiliation(s)
- Giampaolo Brichetto
- Department of Research, Italian Multiple Sclerosis Foundation, Genoa, Italy. .,AISM Rehabilitation Center of Liguria, Genoa, Italy.
| | - Margherita Monti Bragadin
- Department of Research, Italian Multiple Sclerosis Foundation, Genoa, Italy.,AISM Rehabilitation Center of Liguria, Genoa, Italy
| | - Samuele Fiorini
- Department of Informatics, Bioengineering, Robotics and System Engineering, University of Genoa, Genoa, Italy
| | | | | | - Michela Ponzio
- Department of Research, Italian Multiple Sclerosis Foundation, Genoa, Italy
| | - Ludovico Pedullà
- Department of Research, Italian Multiple Sclerosis Foundation, Genoa, Italy
| | - Alessandro Verri
- Department of Informatics, Bioengineering, Robotics and System Engineering, University of Genoa, Genoa, Italy
| | - Annalisa Barla
- Department of Informatics, Bioengineering, Robotics and System Engineering, University of Genoa, Genoa, Italy
| | - Andrea Tacchino
- Department of Research, Italian Multiple Sclerosis Foundation, Genoa, Italy
| |
Collapse
|
8
|
Di Camillo B, Hakaste L, Sambo F, Gabriel R, Kravic J, Isomaa B, Tuomilehto J, Alonso M, Longato E, Facchinetti A, Groop LC, Cobelli C, Tuomi T. HAPT2D: high accuracy of prediction of T2D with a model combining basic and advanced data depending on availability. Eur J Endocrinol 2018; 178:331-341. [PMID: 29371336 DOI: 10.1530/eje-17-0921] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/03/2017] [Accepted: 01/25/2018] [Indexed: 12/26/2022]
Abstract
OBJECTIVE Type 2 diabetes arises from the interaction of physiological and lifestyle risk factors. Our objective was to develop a model for predicting the risk of T2D, which could use various amounts of background information. RESEARCH DESIGN AND METHODS We trained a survival analysis model on 8483 people from three large Finnish and Spanish data sets, to predict the time until incident T2D. All studies included anthropometric data, fasting laboratory values, an oral glucose tolerance test (OGTT) and information on co-morbidities and lifestyle habits. The variables were grouped into three sets reflecting different degrees of information availability. Scenario 1 included background and anthropometric information; Scenario 2 added routine laboratory tests; Scenario 3 also added results from an OGTT. Predictive performance of these models was compared with FINDRISC and Framingham risk scores. RESULTS The three models predicted T2D risk with an average integrated area under the ROC curve equal to 0.83, 0.87 and 0.90, respectively, compared with 0.80 and 0.75 obtained using the FINDRISC and Framingham risk scores. The results were validated on two independent cohorts. Glucose values and particularly 2-h glucose during OGTT (2h-PG) had highest predictive value. Smoking, marital and professional status, waist circumference, blood pressure, age and gender were also predictive. CONCLUSIONS Our models provide an estimation of patient's risk over time and outweigh FINDRISC and Framingham traditional scores for prediction of T2D risk. Of note, the models developed in Scenarios 1 and 2, only exploited variables easily available at general patient visits.
Collapse
Affiliation(s)
- Barbara Di Camillo
- Department of Information EngineeringUniversity of Padova, Padova, Italy
| | - Liisa Hakaste
- EndocrinologyAbdominal Centre, University of Helsinki and Helsinki University Hospital, Research Program for Diabetes and Obesity, University of Helsinki, Helsinki, Finland
- Folkhälsan Research CenterHelsinki, Finland
| | - Francesco Sambo
- Department of Information EngineeringUniversity of Padova, Padova, Italy
| | - Rafael Gabriel
- Department of International HealthNational School of Public Health, Instituto de Salud Carlos III, Madrid, Spain
- Asociación Española Para el Desarrollo de la Epidemiología Clínica (AEDEC)Madrid, Spain
| | - Jasmina Kravic
- Lund University Diabetes CentreDepartment of Clinical Sciences Malmö, Lund University, Skåne University Hospital, Malmö, Sweden
| | - Bo Isomaa
- Folkhälsan Research CenterHelsinki, Finland
| | - Jaakko Tuomilehto
- Asociación Española Para el Desarrollo de la Epidemiología Clínica (AEDEC)Madrid, Spain
- Dasman Diabetes InstituteDasman, Kuwait City, Kuwait
- Department of Neuroscience and Preventive MedicineDanube-University Krems, Krems, Austria
- Saudi Diabetes Research GroupKing Abdulaziz University, Jeddah, Saudi Arabia
| | - Margarita Alonso
- Department of International HealthNational School of Public Health, Instituto de Salud Carlos III, Madrid, Spain
- Asociación Española Para el Desarrollo de la Epidemiología Clínica (AEDEC)Madrid, Spain
| | - Enrico Longato
- Department of Information EngineeringUniversity of Padova, Padova, Italy
| | - Andrea Facchinetti
- Department of Information EngineeringUniversity of Padova, Padova, Italy
| | - Leif C Groop
- Lund University Diabetes CentreDepartment of Clinical Sciences Malmö, Lund University, Skåne University Hospital, Malmö, Sweden
- Institute for Molecular Medicine Finland (FIMM)University of Helsinki, Helsinki, Finland
| | - Claudio Cobelli
- Department of Information EngineeringUniversity of Padova, Padova, Italy
| | - Tiinamaija Tuomi
- EndocrinologyAbdominal Centre, University of Helsinki and Helsinki University Hospital, Research Program for Diabetes and Obesity, University of Helsinki, Helsinki, Finland
- Folkhälsan Research CenterHelsinki, Finland
- Institute for Molecular Medicine Finland (FIMM)University of Helsinki, Helsinki, Finland
| |
Collapse
|
9
|
Vitova L, Tuma Z, Moravec J, Kvapil M, Matejovic M, Mares J. Early urinary biomarkers of diabetic nephropathy in type 1 diabetes mellitus show involvement of kallikrein-kinin system. BMC Nephrol 2017; 18:112. [PMID: 28359252 PMCID: PMC5372325 DOI: 10.1186/s12882-017-0519-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2016] [Accepted: 03/21/2017] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND Additional urinary biomarkers for diabetic nephropathy (DN) are needed, providing early and reliable diagnosis and new insights into its mechanisms. Rigorous selection criteria and homogeneous study population may improve reproducibility of the proteomic approach. METHODS Long-term type 1 diabetes patients without metabolic comorbidities were included, 11 with sustained microalbuminuria (MA) and 14 without MA (nMA). Morning urine proteins were precipitated and resolved by 2D electrophoresis. Principal component analysis (PCA) and Projection to latent structures discriminatory analysis (PLS-DA) were adopted to assess general data validity, to pick protein fractions for identification with mass spectrometry (MS), and to test predictive value of the resulting model. RESULTS Proteins (n = 113) detected in more than 90% patients were considered representative. Unsupervised PCA showed excellent natural data clustering without outliers. Protein spots reaching Variable Importance in Projection score above 1 in PLS (n = 42) were subjected to MS, yielding 33 positive identifications. The PLS model rebuilt with these proteins achieved accurate classification of all patients (R2X = 0.553, R2Y = 0.953, Q2 = 0.947). Thus, multiple earlier recognized biomarkers of DN were confirmed and several putative new biomarkers suggested. Among them, the highest significance was met in kininogen-1. Its activation products detected in nMA patients exceeded by an order of magnitude the amount found in MA patients. CONCLUSIONS Reducing metabolic complexity of the diseased and control groups by meticulous patients' selection allows to focus the biomarker search in DN. Suggested new biomarkers, particularly kininogen fragments, exhibit the highest degree of correlation with MA and substantiate validation in larger and more varied cohorts.
Collapse
Affiliation(s)
- Lenka Vitova
- Department of Internal Medicine, Teaching Hospital Motol, V Uvalu 84, Prague, 5, 150 06, Czech Republic.
| | - Zdenek Tuma
- Proteomic Laboratory, Charles University School of Medicine in Pilsen, alej Svobody 1655/76, Pilsen, 323 00, Czech Republic
| | - Jiri Moravec
- Proteomic Laboratory, Charles University School of Medicine in Pilsen, alej Svobody 1655/76, Pilsen, 323 00, Czech Republic
| | - Milan Kvapil
- Department of Internal Medicine, Teaching Hospital Motol, V Uvalu 84, Prague, 5, 150 06, Czech Republic
| | - Martin Matejovic
- Department of Internal Medicine I, Charles University School of Medicine in Pilsen, alej Svobody 80, Pilsen, 304 60, Czech Republic
| | - Jan Mares
- Proteomic Laboratory, Charles University School of Medicine in Pilsen, alej Svobody 1655/76, Pilsen, 323 00, Czech Republic.,Department of Internal Medicine I, Charles University School of Medicine in Pilsen, alej Svobody 80, Pilsen, 304 60, Czech Republic
| |
Collapse
|
10
|
Gangeh MJ, Zarkoob H, Ghodsi A. Fast and Scalable Feature Selection for Gene Expression Data Using Hilbert-Schmidt Independence Criterion. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:167-181. [PMID: 28182548 DOI: 10.1109/tcbb.2016.2631164] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
GOAL In computational biology, selecting a small subset of informative genes from microarray data continues to be a challenge due to the presence of thousands of genes. This paper aims at quantifying the dependence between gene expression data and the response variables and to identifying a subset of the most informative genes using a fast and scalable multivariate algorithm. METHODS A novel algorithm for feature selection from gene expression data was developed. The algorithm was based on the Hilbert-Schmidt independence criterion (HSIC), and was partly motivated by singular value decomposition (SVD). RESULTS The algorithm is computationally fast and scalable to large datasets. Moreover, it can be applied to problems with any type of response variables including, biclass, multiclass, and continuous response variables. The performance of the proposed algorithm in terms of accuracy, stability of the selected genes, speed, and scalability was evaluated using both synthetic and real-world datasets. The simulation results demonstrated that the proposed algorithm effectively and efficiently extracted stable genes with high predictive capability, in particular for datasets with multiclass response variables. CONCLUSION/SIGNIFICANCE The proposed method does not require the whole microarray dataset to be stored in memory, and thus can easily be scaled to large datasets. This capability is an important attribute in big data analytics, where data can be large and massively distributed.
Collapse
|
11
|
Omae K, Komori O, Eguchi S. Reproducible detection of disease-associated markers from gene expression data. BMC Med Genomics 2016; 9:53. [PMID: 27538512 PMCID: PMC4991096 DOI: 10.1186/s12920-016-0214-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2015] [Accepted: 08/03/2016] [Indexed: 01/22/2023] Open
Abstract
Background Detection of disease-associated markers plays a crucial role in gene screening for biological studies. Two-sample test statistics, such as the t-statistic, are widely used to rank genes based on gene expression data. However, the resultant gene ranking is often not reproducible among different data sets. Such irreproducibility may be caused by disease heterogeneity. Results When we divided data into two subsets, we found that the signs of the two t-statistics were often reversed. Focusing on such instability, we proposed a sign-sum statistic that counts the signs of the t-statistics for all possible subsets. The proposed method excludes genes affected by heterogeneity, thereby improving the reproducibility of gene ranking. We compared the sign-sum statistic with the t-statistic by a theoretical evaluation of the upper confidence limit. Through simulations and applications to real data sets, we show that the sign-sum statistic exhibits superior performance. Conclusion We derive the sign-sum statistic for getting a robust gene ranking. The sign-sum statistic gives more reproducible ranking than the t-statistic. Using simulated data sets we show that the sign-sum statistic excludes hetero-type genes well. Also for the real data sets, the sign-sum statistic performs well in a viewpoint of ranking reproducibility. Electronic supplementary material The online version of this article (doi:10.1186/s12920-016-0214-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Katsuhiro Omae
- Department of Statistical Science, The Graduate University for Advanced Studies, 10-3 Midori-cho, Tachikawa, Tokyo, 190-8562, Japan.
| | - Osamu Komori
- Department of Electrical, Electronic and Computer Engineering, University of Fukui, 3-9-1 Bunkyo, Fukui, Fukui, 910-8507, Japan
| | - Shinto Eguchi
- Department of Statistical Science, The Graduate University for Advanced Studies, 10-3 Midori-cho, Tachikawa, Tokyo, 190-8562, Japan.,The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo, 190-8562, Japan
| |
Collapse
|
12
|
Kamkar I, Gupta SK, Phung D, Venkatesh S. Stabilizing l1-norm prediction models by supervised feature grouping. J Biomed Inform 2015; 59:149-68. [PMID: 26689771 DOI: 10.1016/j.jbi.2015.11.012] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2015] [Revised: 11/18/2015] [Accepted: 11/23/2015] [Indexed: 01/05/2023]
Abstract
Emerging Electronic Medical Records (EMRs) have reformed the modern healthcare. These records have great potential to be used for building clinical prediction models. However, a problem in using them is their high dimensionality. Since a lot of information may not be relevant for prediction, the underlying complexity of the prediction models may not be high. A popular way to deal with this problem is to employ feature selection. Lasso and l1-norm based feature selection methods have shown promising results. But, in presence of correlated features, these methods select features that change considerably with small changes in data. This prevents clinicians to obtain a stable feature set, which is crucial for clinical decision making. Grouping correlated variables together can improve the stability of feature selection, however, such grouping is usually not known and needs to be estimated for optimal performance. Addressing this problem, we propose a new model that can simultaneously learn the grouping of correlated features and perform stable feature selection. We formulate the model as a constrained optimization problem and provide an efficient solution with guaranteed convergence. Our experiments with both synthetic and real-world datasets show that the proposed model is significantly more stable than Lasso and many existing state-of-the-art shrinkage and classification methods. We further show that in terms of prediction performance, the proposed method consistently outperforms Lasso and other baselines. Our model can be used for selecting stable risk factors for a variety of healthcare problems, so it can assist clinicians toward accurate decision making.
Collapse
Affiliation(s)
- Iman Kamkar
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| | - Sunil Kumar Gupta
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| | - Dinh Phung
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| | - Svetha Venkatesh
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| |
Collapse
|
13
|
Georga EI, Protopappas VC, Polyzos D, Fotiadis DI. Evaluation of short-term predictors of glucose concentration in type 1 diabetes combining feature ranking with regression models. Med Biol Eng Comput 2015; 53:1305-18. [PMID: 25773366 DOI: 10.1007/s11517-015-1263-1] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2014] [Accepted: 02/27/2015] [Indexed: 01/04/2023]
Abstract
Glucose concentration in type 1 diabetes is a function of biological and environmental factors which present high inter-patient variability. The objective of this study is to evaluate a number of features, which are extracted from medical and lifestyle self-monitoring data, with respect to their ability to predict the short-term subcutaneous (s.c.) glucose concentration of an individual. Random forests (RF) and RReliefF algorithms are first employed to rank the candidate feature set. Then, a forward selection procedure follows to build a glucose predictive model, where features are sequentially added to it in decreasing order of importance. Predictions are performed using support vector regression or Gaussian processes. The proposed method is validated on a dataset of 15 type diabetics in real-life conditions. The s.c. glucose profile along with time of the day and plasma insulin concentration are systematically highly ranked, while the effect of food intake and physical activity varies considerably among patients. Moreover, the average prediction error converges in less than d/2 iterations (d is the number of features). Our results suggest that RF and RReliefF can find the most informative features and can be successfully used to customize the input of glucose models.
Collapse
Affiliation(s)
- Eleni I Georga
- Unit of Medical Technology and Intelligent Information Systems, Department of Materials Science and Engineering, University of Ioannina, 45110, Ioannina, Greece
| | - Vasilios C Protopappas
- Unit of Medical Technology and Intelligent Information Systems, Department of Materials Science and Engineering, University of Ioannina, 45110, Ioannina, Greece
| | - Demosthenes Polyzos
- Department of Mechanical Engineering and Aeronautics, University of Patras, 26500, Patras, Greece
| | - Dimitrios I Fotiadis
- Unit of Medical Technology and Intelligent Information Systems, Department of Materials Science and Engineering, University of Ioannina, 45110, Ioannina, Greece.
| |
Collapse
|
14
|
Pegolo S, Di Camillo B, Montesissa C, Cannizzo FT, Biolatti B, Bargelloni L. Toxicogenomic markers for corticosteroid treatment in beef cattle: Integrated analysis of transcriptomic data. Food Chem Toxicol 2015; 77:1-11. [DOI: 10.1016/j.fct.2014.12.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Revised: 11/26/2014] [Accepted: 12/02/2014] [Indexed: 11/29/2022]
|
15
|
Sambo F, Malovini A, Sandholm N, Stavarachi M, Forsblom C, Mäkinen VP, Harjutsalo V, Lithovius R, Gordin D, Parkkonen M, Saraheimo M, Thorn LM, Tolonen N, Wadén J, He B, Osterholm AM, Tuomilehto J, Lajer M, Salem RM, McKnight AJ, Tarnow L, Panduru NM, Barbarini N, Di Camillo B, Toffolo GM, Tryggvason K, Bellazzi R, Cobelli C, Groop PH. Novel genetic susceptibility loci for diabetic end-stage renal disease identified through robust naive Bayes classification. Diabetologia 2014; 57:1611-22. [PMID: 24871321 DOI: 10.1007/s00125-014-3256-2] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/10/2013] [Accepted: 04/11/2014] [Indexed: 10/25/2022]
Abstract
AIMS/HYPOTHESIS Diabetic nephropathy is a major diabetic complication, and diabetes is the leading cause of end-stage renal disease (ESRD). Family studies suggest a hereditary component for diabetic nephropathy. However, only a few genes have been associated with diabetic nephropathy or ESRD in diabetic patients. Our aim was to detect novel genetic variants associated with diabetic nephropathy and ESRD. METHODS We exploited a novel algorithm, 'Bag of Naive Bayes', whose marker selection strategy is complementary to that of conventional genome-wide association models based on univariate association tests. The analysis was performed on a genome-wide association study of 3,464 patients with type 1 diabetes from the Finnish Diabetic Nephropathy (FinnDiane) Study and subsequently replicated with 4,263 type 1 diabetes patients from the Steno Diabetes Centre, the All Ireland-Warren 3-Genetics of Kidneys in Diabetes UK collection (UK-Republic of Ireland) and the Genetics of Kidneys in Diabetes US Study (GoKinD US). RESULTS Five genetic loci (WNT4/ZBTB40-rs12137135, RGMA/MCTP2-rs17709344, MAPRE1P2-rs1670754, SEMA6D/SLC24A5-rs12917114 and SIK1-rs2838302) were associated with ESRD in the FinnDiane study. An association between ESRD and rs17709344, tagging the previously identified rs12437854 and located between the RGMA and MCTP2 genes, was replicated in independent case-control cohorts. rs12917114 near SEMA6D was associated with ESRD in the replication cohorts under the genotypic model (p < 0.05), and rs12137135 upstream of WNT4 was associated with ESRD in Steno. CONCLUSIONS/INTERPRETATION This study supports the previously identified findings on the RGMA/MCTP2 region and suggests novel susceptibility loci for ESRD. This highlights the importance of applying complementary statistical methods to detect novel genetic variants in diabetic nephropathy and, in general, in complex diseases.
Collapse
Affiliation(s)
- Francesco Sambo
- Department of Information Engineering, University of Padova, Padova, Italy
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Östlund G, Sonnhammer EL. Avoiding pitfalls in gene (co)expression meta-analysis. Genomics 2014; 103:21-30. [DOI: 10.1016/j.ygeno.2013.10.006] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2013] [Revised: 09/30/2013] [Accepted: 10/22/2013] [Indexed: 11/16/2022]
|
17
|
Di Camillo B, Sambo F, Toffolo G, Cobelli C. ABACUS: an entropy-based cumulative bivariate statistic robust to rare variants and different direction of genotype effect. ACTA ACUST UNITED AC 2013; 30:384-91. [PMID: 24292361 DOI: 10.1093/bioinformatics/btt697] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
MOTIVATION In the past years, both sequencing and microarray have been widely used to search for relations between genetic variations and predisposition to complex pathologies such as diabetes or neurological disorders. These studies, however, have been able to explain only a small fraction of disease heritability, possibly because complex pathologies cannot be referred to few dysfunctional genes, but are rather heterogeneous and multicausal, as a result of a combination of rare and common variants possibly impairing multiple regulatory pathways. Rare variants, though, are difficult to detect, especially when the effects of causal variants are in different directions, i.e. with protective and detrimental effects. RESULTS Here, we propose ABACUS, an Algorithm based on a BivAriate CUmulative Statistic to identify single nucleotide polymorphisms (SNPs) significantly associated with a disease within predefined sets of SNPs such as pathways or genomic regions. ABACUS is robust to the concurrent presence of SNPs with protective and detrimental effects and of common and rare variants; moreover, it is powerful even when few SNPs in the SNP-set are associated with the phenotype. We assessed ABACUS performance on simulated and real data and compared it with three state-of-the-art methods. When ABACUS was applied to type 1 and 2 diabetes data, besides observing a wide overlap with already known associations, we found a number of biologically sound pathways, which might shed light on diabetes mechanism and etiology. AVAILABILITY AND IMPLEMENTATION ABACUS is available at http://www.dei.unipd.it/∼dicamill/pagine/Software.html.
Collapse
Affiliation(s)
- Barbara Di Camillo
- Department of Information Engineering, University of Padova, via Gradenigo 6B, 35131 Padova, Italy
| | | | | | | |
Collapse
|
18
|
|
19
|
Wu MY, Dai DQ, Zhang XF, Zhu Y. Cancer subtype discovery and biomarker identification via a new robust network clustering algorithm. PLoS One 2013; 8:e66256. [PMID: 23799085 PMCID: PMC3684607 DOI: 10.1371/journal.pone.0066256] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2013] [Accepted: 05/02/2013] [Indexed: 11/29/2022] Open
Abstract
In cancer biology, it is very important to understand the phenotypic changes of the patients and discover new cancer subtypes. Recently, microarray-based technologies have shed light on this problem based on gene expression profiles which may contain outliers due to either chemical or electrical reasons. These undiscovered subtypes may be heterogeneous with respect to underlying networks or pathways, and are related with only a few of interdependent biomarkers. This motivates a need for the robust gene expression-based methods capable of discovering such subtypes, elucidating the corresponding network structures and identifying cancer related biomarkers. This study proposes a penalized model-based Student’s t clustering with unconstrained covariance (PMT-UC) to discover cancer subtypes with cluster-specific networks, taking gene dependencies into account and having robustness against outliers. Meanwhile, biomarker identification and network reconstruction are achieved by imposing an adaptive penalty on the means and the inverse scale matrices. The model is fitted via the expectation maximization algorithm utilizing the graphical lasso. Here, a network-based gene selection criterion that identifies biomarkers not as individual genes but as subnetworks is applied. This allows us to implicate low discriminative biomarkers which play a central role in the subnetwork by interconnecting many differentially expressed genes, or have cluster-specific underlying network structures. Experiment results on simulated datasets and one available cancer dataset attest to the effectiveness, robustness of PMT-UC in cancer subtype discovering. Moveover, PMT-UC has the ability to select cancer related biomarkers which have been verified in biochemical or biomedical research and learn the biological significant correlation among genes.
Collapse
Affiliation(s)
- Meng-Yun Wu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| | - Dao-Qing Dai
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
- * E-mail:
| | - Xiao-Fei Zhang
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| | - Yuan Zhu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
- Department of Mathematics, Guangdong University of Business Studies, Guangzhou, China
| |
Collapse
|
20
|
Zycinski G, Barla A, Squillario M, Sanavia T, Camillo BD, Verri A. Knowledge Driven Variable Selection (KDVS) - a new approach to enrichment analysis of gene signatures obtained from high-throughput data. SOURCE CODE FOR BIOLOGY AND MEDICINE 2013; 8:2. [PMID: 23302187 PMCID: PMC3605163 DOI: 10.1186/1751-0473-8-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2012] [Accepted: 12/13/2012] [Indexed: 11/10/2022]
Abstract
Background High–throughput (HT) technologies provide huge amount of gene expression data that can be used to identify biomarkers useful in the clinical practice. The most frequently used approaches first select a set of genes (i.e. gene signature) able to characterize differences between two or more phenotypical conditions, and then provide a functional assessment of the selected genes with an a posteriori enrichment analysis, based on biological knowledge. However, this approach comes with some drawbacks. First, gene selection procedure often requires tunable parameters that affect the outcome, typically producing many false hits. Second, a posteriori enrichment analysis is based on mapping between biological concepts and gene expression measurements, which is hard to compute because of constant changes in biological knowledge and genome analysis. Third, such mapping is typically used in the assessment of the coverage of gene signature by biological concepts, that is either score–based or requires tunable parameters as well, limiting its power. Results We present Knowledge Driven Variable Selection (KDVS), a framework that uses a priori biological knowledge in HT data analysis. The expression data matrix is transformed, according to prior knowledge, into smaller matrices, easier to analyze and to interpret from both computational and biological viewpoints. Therefore KDVS, unlike most approaches, does not exclude a priori any function or process potentially relevant for the biological question under investigation. Differently from the standard approach where gene selection and functional assessment are applied independently, KDVS embeds these two steps into a unified statistical framework, decreasing the variability derived from the threshold–dependent selection, the mapping to the biological concepts, and the signature coverage. We present three case studies to assess the usefulness of the method. Conclusions We showed that KDVS not only enables the selection of known biological functionalities with accuracy, but also identification of new ones. An efficient implementation of KDVS was devised to obtain results in a fast and robust way. Computing time is drastically reduced by the effective use of distributed resources. Finally, integrated visualization techniques immediately increase the interpretability of results. Overall, KDVS approach can be considered as a viable alternative to enrichment–based approaches.
Collapse
Affiliation(s)
- Grzegorz Zycinski
- DIBRIS, University of Genoa, via Dodecaneso 35, I-16146 Genova, Italy.
| | | | | | | | | | | |
Collapse
|
21
|
Jurman G, Riccadonna S, Visintainer R, Furlanello C. Algebraic comparison of partial lists in bioinformatics. PLoS One 2012; 7:e36540. [PMID: 22615778 PMCID: PMC3355159 DOI: 10.1371/journal.pone.0036540] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2011] [Accepted: 04/06/2012] [Indexed: 12/20/2022] Open
Abstract
The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or to a meta-analysis comparison, it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained, instead of just one list. Here we introduce a method, based on permutations, for studying the variability between lists ("list stability") in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated by finding and comparing gene profiles on a large prostate cancer dataset, consisting of two cohorts of patients from different countries, for a total of 455 samples.
Collapse
|