1
|
You W, Yang Z, Ji G. PLS-based gene subset augmentation and tumor-specific gene identification. Comput Biol Med 2024; 174:108434. [PMID: 38636329 DOI: 10.1016/j.compbiomed.2024.108434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/18/2024] [Accepted: 04/07/2024] [Indexed: 04/20/2024]
Abstract
In the study of tumor disease pathogenesis, the identification of genes specifically expressed in disease states is pivotal, yet challenges arise from high-dimensional datasets with limited samples. Conventional gene (feature) selection methods often fall short of capturing the complexity of gene-phenotype and gene-gene interactions, necessitating a more robust analysis method. To address these challenges, a gene subset augmentation strategy is proposed in this paper. Our approach introduces diverse perturbation mechanisms to generate distinct gene subsets. The partial least squares-based multiple gene measurement algorithm considers gene-phenotype and gene-gene correlations, identifying differentially expressed genes, including those with weak signals. The constructed gene networks derived from the augmented subsets unveil regulatory patterns, enabling association analysis to explore gene associations comprehensively. Our algorithm excels in identifying small-sized gene subsets with strong discriminative power, surpassing traditional methods that yield a single gene subset. Unlike conventional approaches, our algorithm reveals a spectrum of different gene subsets and their weakly differentially expressed genes. This nuanced perspective aids in unraveling the molecular characteristics and specific expression patterns of tumor genes. The versatility of our approach not only contributes to the advancement of tumor-specific gene identification but also holds promise for addressing challenges in various fields characterized by high-dimensional datasets and limited samples. The Python implementation is available at http://github.com/wenjieyou/PLSGSA.
Collapse
Affiliation(s)
- Wenjie You
- School of Big Data and Artificial Intelligence, Fujian Polytechnic Normal University, Fuqing, 350300, China.
| | - Zijiang Yang
- School of Information Technology, York University, Toronto, M3J 1P3, Canada.
| | - Guoli Ji
- Department of Automation, Xiamen University, Xiamen, 361005, China.
| |
Collapse
|
2
|
Sadacca B, Hamy AS, Laurent C, Gestraud P, Bonsang-Kitzis H, Pinheiro A, Abecassis J, Neuvial P, Reyal F. New insight for pharmacogenomics studies from the transcriptional analysis of two large-scale cancer cell line panels. Sci Rep 2017; 7:15126. [PMID: 29123141 PMCID: PMC5680301 DOI: 10.1038/s41598-017-14770-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Accepted: 10/12/2017] [Indexed: 12/31/2022] Open
Abstract
One of the most challenging problems in the development of new anticancer drugs is the very high attrition rate. The so-called “drug repositioning process” propose to find new therapeutic indications to already approved drugs. For this, new analytic methods are required to optimize the information present in large-scale pharmacogenomics datasets. We analyzed data from the Genomics of Drug Sensitivity in Cancer and Cancer Cell Line Encyclopedia studies. We focused on common cell lines (n = 471), considering the molecular information, and the drug sensitivity for common drugs screened (n = 15). We propose a novel classification based on transcriptomic profiles of cell lines, according to a biological network-driven gene selection process. Our robust molecular classification displays greater homogeneity of drug sensitivity than cancer cell line grouped based on tissue of origin. We then identified significant associations between cell line cluster and drug response robustly found between both datasets. We further demonstrate the relevance of our method using two additional external datasets and distinct sensitivity metrics. Some associations were still found robust, despite cell lines and drug responses’ variations. This study defines a robust molecular classification of cancer cell lines that could be used to find new therapeutic indications to known compounds.
Collapse
Affiliation(s)
- Benjamin Sadacca
- Residual Tumor & Response to Treatment Laboratory (RT2Lab), PSL Research University, Translational Research Department, F-75248, Paris, France.,U932 Immunity and Cancer; INSERM; Institut Curie, Paris, France.,Laboratoire de Mathématiques et Modélisation d'Evry, Université d'Évry Val d'Essonne, UMR CNRS 8071, ENSIIE, USC INRA, Evry Val d'Essonne, France
| | - Anne-Sophie Hamy
- Residual Tumor & Response to Treatment Laboratory (RT2Lab), PSL Research University, Translational Research Department, F-75248, Paris, France.,U932 Immunity and Cancer; INSERM; Institut Curie, Paris, France
| | - Cécile Laurent
- Residual Tumor & Response to Treatment Laboratory (RT2Lab), PSL Research University, Translational Research Department, F-75248, Paris, France.,U932 Immunity and Cancer; INSERM; Institut Curie, Paris, France
| | - Pierre Gestraud
- Institut Curie, PSL Research University, Mines Paris Tech, Bioinformatics and Computational Systems Biology of Cancer, INSERM U900, F-75005, Paris, France
| | - Hélène Bonsang-Kitzis
- Residual Tumor & Response to Treatment Laboratory (RT2Lab), PSL Research University, Translational Research Department, F-75248, Paris, France.,U932 Immunity and Cancer; INSERM; Institut Curie, Paris, France.,Department of Surgery, Institut Curie, Paris, F-75248, France
| | - Alice Pinheiro
- Residual Tumor & Response to Treatment Laboratory (RT2Lab), PSL Research University, Translational Research Department, F-75248, Paris, France.,U932 Immunity and Cancer; INSERM; Institut Curie, Paris, France
| | - Judith Abecassis
- Residual Tumor & Response to Treatment Laboratory (RT2Lab), PSL Research University, Translational Research Department, F-75248, Paris, France.,U932 Immunity and Cancer; INSERM; Institut Curie, Paris, France.,Mines Paristech, PSL-Research University, CBIO-Centre for Computational Biology, Mines ParisTech, Fontainebleau, F-77300, France.,Institut Curie, PSL Research University, Mines Paris Tech, Bioinformatics and Computational Systems Biology of Cancer, INSERM U900, F-75005, Paris, France
| | - Pierre Neuvial
- Laboratoire de Mathématiques et Modélisation d'Evry, Université d'Évry Val d'Essonne, UMR CNRS 8071, ENSIIE, USC INRA, Evry Val d'Essonne, France.,Institut de Mathématiques de Toulouse; UMR5219 Université de Toulouse; CNRS UPS IMT, F-31062, Toulouse Cedex 9, France
| | - Fabien Reyal
- Residual Tumor & Response to Treatment Laboratory (RT2Lab), PSL Research University, Translational Research Department, F-75248, Paris, France. .,U932 Immunity and Cancer; INSERM; Institut Curie, Paris, France. .,Department of Surgery, Institut Curie, Paris, F-75248, France.
| |
Collapse
|
3
|
Macías-García L, Luna-Romera JM, García-Gutiérrez J, Martínez-Ballesteros M, Riquelme-Santos JC, González-Cámpora R. A study of the suitability of autoencoders for preprocessing data in breast cancer experimentation. J Biomed Inform 2017; 72:33-44. [DOI: 10.1016/j.jbi.2017.06.020] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2016] [Revised: 05/19/2017] [Accepted: 06/25/2017] [Indexed: 12/15/2022]
|
4
|
Bonsang-Kitzis H, Sadacca B, Hamy-Petit AS, Moarii M, Pinheiro A, Laurent C, Reyal F. Biological network-driven gene selection identifies a stromal immune module as a key determinant of triple-negative breast carcinoma prognosis. Oncoimmunology 2015; 5:e1061176. [PMID: 26942074 DOI: 10.1080/2162402x.2015.1061176] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 06/02/2015] [Accepted: 06/08/2015] [Indexed: 12/31/2022] Open
Abstract
Triple-negative breast cancer (TNBC) is a heterogeneous group of aggressive breast cancers for which no targeted treatment is available. Robust tools for TNBC classification are required, to improve the prediction of prognosis and to develop novel therapeutic interventions. We analyzed 3,247 primary human breast cancer samples from 21 publicly available datasets, using a five-step method: (1) selection of TNBC samples by bimodal filtering on ER-HER2 and PR, (2) normalization of the selected TNBC samples, (3) selection of the most variant genes, (4) identification of gene clusters and biological gene selection within gene clusters on the basis of String© database connections and gene-expression correlations, (5) summarization of each gene cluster in a metagene. We then assessed the ability of these metagenes to predict prognosis, on an external public dataset (METABRIC). Our analysis of gene expression (GE) in 557 TNBCs from 21 public datasets identified a six-metagene signature (167 genes) in which the metagenes were enriched in different gene ontologies. The gene clusters were named as follows: Immunity1, Immunity2, Proliferation/DNA damage, AR-like, Matrix/Invasion1 and Matrix2 clusters respectively. This signature was particularly robust for the identification of TNBC subtypes across many datasets (n = 1,125 samples), despite technology differences (Affymetrix© A, Plus2 and Illumina©). Weak Immunity two metagene expression was associated with a poor prognosis (disease-specific survival; HR = 2.68 [1.59-4.52], p = 0.0002). The six-metagene signature (167 genes) was validated over 1,125 TNBC samples. The Immunity two metagene had strong prognostic value. These findings open up interesting possibilities for the development of new therapeutic interventions.
Collapse
Affiliation(s)
- H Bonsang-Kitzis
- Residual Tumor & Response to Treatment Laboratory; RT2Lab; Translational Research Department; Institut Curie; Paris, France; U932 Immunity and Cancer; INSERM; Institut Curie; Paris, France; Department of Surgery; Institut Curie; Paris, France
| | - B Sadacca
- Residual Tumor & Response to Treatment Laboratory; RT2Lab; Translational Research Department; Institut Curie; Paris, France; U932 Immunity and Cancer; INSERM; Institut Curie; Paris, France; Laboratoire de Mathématiques et Modélisation d'Evry, Université d'Évry Val d'Essonne; UMR CNRS 8071, ENSIIE, USC INRA, France
| | - A S Hamy-Petit
- Residual Tumor & Response to Treatment Laboratory; RT2Lab; Translational Research Department; Institut Curie; Paris, France; U932 Immunity and Cancer; INSERM; Institut Curie; Paris, France
| | - M Moarii
- Mines Paristech; PSL-Research University; CBIO-Centre for Computational Biology; Mines ParisTech; Fontainebleau, France; U900, INSERM; Institut Curie; Paris, France
| | - A Pinheiro
- Residual Tumor & Response to Treatment Laboratory; RT2Lab; Translational Research Department; Institut Curie; Paris, France; U932 Immunity and Cancer; INSERM; Institut Curie; Paris, France
| | - C Laurent
- Residual Tumor & Response to Treatment Laboratory; RT2Lab; Translational Research Department; Institut Curie; Paris, France; U932 Immunity and Cancer; INSERM; Institut Curie; Paris, France
| | - F Reyal
- Residual Tumor & Response to Treatment Laboratory; RT2Lab; Translational Research Department; Institut Curie; Paris, France; U932 Immunity and Cancer; INSERM; Institut Curie; Paris, France; Department of Surgery; Institut Curie; Paris, France
| |
Collapse
|
5
|
Romanov V, Davidoff SN, Miles AR, Grainger DW, Gale BK, Brooks BD. A critical comparison of protein microarray fabrication technologies. Analyst 2015; 139:1303-26. [PMID: 24479125 DOI: 10.1039/c3an01577g] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Of the diverse analytical tools used in proteomics, protein microarrays possess the greatest potential for providing fundamental information on protein, ligand, analyte, receptor, and antibody affinity-based interactions, binding partners and high-throughput analysis. Microarrays have been used to develop tools for drug screening, disease diagnosis, biochemical pathway mapping, protein-protein interaction analysis, vaccine development, enzyme-substrate profiling, and immuno-profiling. While the promise of the technology is intriguing, it is yet to be realized. Many challenges remain to be addressed to allow these methods to meet technical and research expectations, provide reliable assay answers, and to reliably diversify their capabilities. Critical issues include: (1) inconsistent printed microspot morphologies and uniformities, (2) low signal-to-noise ratios due to factors such as complex surface capture protocols, contamination, and static or no-flow mass transport conditions, (3) inconsistent quantification of captured signal due to spot uniformity issues, (4) non-optimal protocol conditions such as pH, temperature, drying that promote variability in assay kinetics, and lastly (5) poor protein (e.g., antibody) printing, storage, or shelf-life compatibility with common microarray assay fabrication methods, directly related to microarray protocols. Conventional printing approaches, including contact (e.g., quill and solid pin), non-contact (e.g., piezo and inkjet), microfluidics-based, microstamping, lithography, and cell-free protein expression microarrays, have all been used with varying degrees of success with figures of merit often defined arbitrarily without comparisons to standards, or analytical or fiduciary controls. Many microarray performance reports use bench top analyte preparations lacking real-world relevance, akin to "fishing in a barrel", for proof of concept and determinations of figures of merit. This review critiques current protein-based microarray preparation techniques commonly used for analytical and function-based proteomics and their effects on array-based assay performance.
Collapse
Affiliation(s)
- Valentin Romanov
- Wasatch Microfluidics, LLC, 825 N. 300 W., Suite C325, Salt Lake City, UT, USA.
| | | | | | | | | | | |
Collapse
|
6
|
Mentink A, Hulsman M, Groen N, Licht R, Dechering KJ, van der Stok J, Alves HA, Dhert WJ, van Someren EP, Reinders MJ, van Blitterswijk CA, de Boer J. Predicting the therapeutic efficacy of MSC in bone tissue engineering using the molecular marker CADM1. Biomaterials 2013; 34:4592-601. [DOI: 10.1016/j.biomaterials.2013.03.001] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2013] [Accepted: 03/01/2013] [Indexed: 12/17/2022]
|
7
|
Badsha MB, Mollah MNH, Jahan N, Kurata H. Robust complementary hierarchical clustering for gene expression data analysis by β-divergence. J Biosci Bioeng 2013; 116:397-407. [PMID: 23608734 DOI: 10.1016/j.jbiosc.2013.03.010] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2012] [Revised: 03/08/2013] [Accepted: 03/12/2013] [Indexed: 11/17/2022]
Abstract
A hierarchical clustering (HC) algorithm is one of the most widely used unsupervised statistical techniques for analyzing microarray gene expression data. When applying the HC algorithm to the gene expression data to cluster individuals, most of the HC algorithms generate clusters based on the highly differentially expressed (DE) genes that have very similar expression patterns. These highly DE genes may sometimes be irrelevant in biological processes. The serious problem is that those irrelevant genes with high expressions potentially drown out the low expressed genes that have important biological functions. To overcome the problem, Nowak and Tibshirani proposed the complementary hierarchical clustering (CHC) (Biostatistics, 9, 467-483, 2008). However, it is not robust against outlying expression and often produces misleading results if there exist some contaminations in the gene expression data. Thus, we propose the robust CHC (RCHC) method to robustify the CHC with respect to outliers by maximizing the β-likelihood function for sequential extraction of a gene-set with proper groups of individuals. Note that the proposed method reduces to the CHC with the tuning parameter β → 0. A value of β plays a key role in the performance of the RCHC method, which controls the tradeoff between the robustness and efficiency of the estimators. Using simulation and real gene expression analysis, the RCHC method shows robust properties to gene expression clustering with respect to data contaminations, overcomes the problem of the CHC, and predicts critically important genes from breast cancer data.
Collapse
Affiliation(s)
- Md Bahadur Badsha
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan.
| | | | | | | |
Collapse
|
8
|
Burton M, Thomassen M, Tan Q, Kruse TA. Prediction of breast cancer metastasis by gene expression profiles: a comparison of metagenes and single genes. Cancer Inform 2012; 11:193-217. [PMID: 23304070 PMCID: PMC3529607 DOI: 10.4137/cin.s10375] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Background The popularity of a large number of microarray applications has in cancer research led to the development of predictive or prognostic gene expression profiles. However, the diversity of microarray platforms has made the full validation of such profiles and their related gene lists across studies difficult and, at the level of classification accuracies, rarely validated in multiple independent datasets. Frequently, while the individual genes between such lists may not match, genes with same function are included across such gene lists. Development of such lists does not take into account the fact that genes can be grouped together as metagenes (MGs) based on common characteristics such as pathways, regulation, or genomic location. Such MGs might be used as features in building a predictive model applicable for classifying independent data. It is, therefore, demanding to systematically compare independent validation of gene lists or classifiers based on metagene or individual gene (SG) features. Methods In this study we compared the performance of either metagene-or single gene-based feature sets and classifiers using random forest and two support vector machines for classifier building. The performance within the same dataset, feature set validation performance, and validation performance of entire classifiers in strictly independent datasets were assessed by 10 times repeated 10-fold cross validation, leave-one-out cross validation, and one-fold validation, respectively. To test the significance of the performance difference between MG- and SG-features/classifiers, we used a repeated down-sampled binomial test approach. Results MG- and SG-feature sets are transferable and perform well for training and testing prediction of metastasis outcome in strictly independent data sets, both between different and within similar microarray platforms, while classifiers had a poorer performance when validated in strictly independent datasets. The study showed that MG- and SG-feature sets perform equally well in classifying independent data. Furthermore, SG-classifiers significantly outperformed MG-classifier when validation is conducted between datasets using similar platforms, while no significant performance difference was found when validation was performed between different platforms. Conclusion Prediction of metastasis outcome in lymph node–negative patients by MG- and SG-classifiers showed that SG-classifiers performed significantly better than MG-classifiers when validated in independent data based on the same microarray platform as used for developing the classifier. However, the MG- and SG-classifiers had similar performance when conducting classifier validation in independent data based on a different microarray platform. The latter was also true when only validating sets of MG- and SG-features in independent datasets, both between and within similar and different platforms.
Collapse
Affiliation(s)
- Mark Burton
- Institute of Clinical Research, Research Unit of Human Genetics, University of Southern Denmark, Odense, Denmark ; Department of Clinical Genetics, Odense University Hospital, Odense, Denmark
| | | | | | | |
Collapse
|
9
|
Leung YY, Chang CQ, Hung YS. An integrated approach for identifying wrongly labelled samples when performing classification in microarray data. PLoS One 2012; 7:e46700. [PMID: 23082127 PMCID: PMC3474777 DOI: 10.1371/journal.pone.0046700] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2012] [Accepted: 09/03/2012] [Indexed: 01/05/2023] Open
Abstract
Background Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own. Results We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples' labels. Almost all the ‘wrong’ (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.
Collapse
Affiliation(s)
- Yuk Yee Leung
- Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong Special Administrative Region, China.
| | | | | |
Collapse
|
10
|
Yang X, Regan K, Huang Y, Zhang Q, Li J, Seiwert TY, Cohen EEW, Xing HR, Lussier YA. Single sample expression-anchored mechanisms predict survival in head and neck cancer. PLoS Comput Biol 2012; 8:e1002350. [PMID: 22291585 PMCID: PMC3266878 DOI: 10.1371/journal.pcbi.1002350] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2011] [Accepted: 11/28/2011] [Indexed: 12/11/2022] Open
Abstract
Gene expression signatures that are predictive of therapeutic response or prognosis are increasingly useful in clinical care; however, mechanistic (and intuitive) interpretation of expression arrays remains an unmet challenge. Additionally, there is surprisingly little gene overlap among distinct clinically validated expression signatures. These “causality challenges” hinder the adoption of signatures as compared to functionally well-characterized single gene biomarkers. To increase the utility of multi-gene signatures in survival studies, we developed a novel approach to generate “personal mechanism signatures” of molecular pathways and functions from gene expression arrays. FAIME, the Functional Analysis of Individual Microarray Expression, computes mechanism scores using rank-weighted gene expression of an individual sample. By comparing head and neck squamous cell carcinoma (HNSCC) samples with non-tumor control tissues, the precision and recall of deregulated FAIME-derived mechanisms of pathways and molecular functions are comparable to those produced by conventional cohort-wide methods (e.g. GSEA). The overlap of “Oncogenic FAIME Features of HNSCC” (statistically significant and differentially regulated FAIME-derived genesets representing GO functions or KEGG pathways derived from HNSCC tissue) among three distinct HNSCC datasets (pathways:46%, p<0.001) is more significant than the gene overlap (genes:4%). These Oncogenic FAIME Features of HNSCC can accurately discriminate tumors from control tissues in two additional HNSCC datasets (n = 35 and 91, F-accuracy = 100% and 97%, empirical p<0.001, area under the receiver operating characteristic curves = 99% and 92%), and stratify recurrence-free survival in patients from two independent studies (p = 0.0018 and p = 0.032, log-rank). Previous approaches depending on group assignment of individual samples before selecting features or learning a classifier are limited by design to discrete-class prediction. In contrast, FAIME calculates mechanism profiles for individual patients without requiring group assignment in validation sets. FAIME is more amenable for clinical deployment since it translates the gene-level measurements of each given sample into pathways and molecular function profiles that can be applied to analyze continuous phenotypes in clinical outcome studies (e.g. survival time, tumor volume). Clinical utilization of multi-gene expression signatures that are predictive of therapeutic response has been steadily increasing, however, interpretation of such results remains challenging because multi-gene signatures, generated from analyzing different patient cohorts, tend to be equally predictive but contain minimal overlap. Whereas pathway-level analyses of expression arrays show promise for generating clinically meaningful mechanistic signatures, current approaches do not permit single-patient based analyses that are independent of cross-group calculations. To bridge the gap between deterministic biological mechanisms of single-gene biomarkers and the statistical predictive power of multi-gene signatures that are disconnected from mechanisms, we developed FAIME, a novel method that transforms microarray gene expression data into individualized patient profiles of molecular mechanisms. We have validated its capability for predicting clinical outcomes, including cancer patient samples derived from six different clinical trial cohorts of head and neck cancers. This method provides opportunities to harness an untapped resource for personal genomics: clinical evaluation and testing of individually interpretable mechanistic profiles derived from gene expression arrays.
Collapse
Affiliation(s)
- Xinan Yang
- Center for Biomedical Informatics, The University of Chicago, Chicago, Illinois, United States of America
- Section of Genetic Medicine, The University of Chicago, Chicago, Illinois, United States of America
| | - Kelly Regan
- Center for Biomedical Informatics, The University of Chicago, Chicago, Illinois, United States of America
| | - Yong Huang
- Center for Biomedical Informatics, The University of Chicago, Chicago, Illinois, United States of America
- Section of Genetic Medicine, The University of Chicago, Chicago, Illinois, United States of America
| | - Qingbei Zhang
- Center for Biomedical Informatics, The University of Chicago, Chicago, Illinois, United States of America
- Section of Genetic Medicine, The University of Chicago, Chicago, Illinois, United States of America
| | - Jianrong Li
- Center for Biomedical Informatics, The University of Chicago, Chicago, Illinois, United States of America
- Section of Genetic Medicine, The University of Chicago, Chicago, Illinois, United States of America
| | - Tanguy Y. Seiwert
- Section of Hematology/Oncology of the Department of Medicine, The University of Chicago, Chicago, Illinois, United States of America
- Comprehensive Cancer Center, The University of Chicago, Chicago, Illinois, United States of America
| | - Ezra E. W. Cohen
- Section of Hematology/Oncology of the Department of Medicine, The University of Chicago, Chicago, Illinois, United States of America
- Comprehensive Cancer Center, The University of Chicago, Chicago, Illinois, United States of America
| | - H. Rosie Xing
- Comprehensive Cancer Center, The University of Chicago, Chicago, Illinois, United States of America
- Departments of Pathology and of Cellular and Radiation Oncology, The University of Chicago, Chicago, Illinois, United States of America
- Ludwig Center for Metastasis Research, The University of Chicago, Chicago, Illinois, United States of America
| | - Yves A. Lussier
- Center for Biomedical Informatics, The University of Chicago, Chicago, Illinois, United States of America
- Section of Genetic Medicine, The University of Chicago, Chicago, Illinois, United States of America
- Comprehensive Cancer Center, The University of Chicago, Chicago, Illinois, United States of America
- Departments of Pathology and of Cellular and Radiation Oncology, The University of Chicago, Chicago, Illinois, United States of America
- Ludwig Center for Metastasis Research, The University of Chicago, Chicago, Illinois, United States of America
- Computation Institute, Institute for Translational Medicine, and Institute for Genomics and Systems Biology, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail:
| |
Collapse
|
11
|
Lorena AC, Costa IG, Spolaôr N, de Souto MC. Analysis of complexity indices for classification problems: Cancer gene expression data. Neurocomputing 2012. [DOI: 10.1016/j.neucom.2011.03.054] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
|
12
|
Hess KR, Wei C, Qi Y, Iwamoto T, Symmans WF, Pusztai L. Lack of sufficiently strong informative features limits the potential of gene expression analysis as predictive tool for many clinical classification problems. BMC Bioinformatics 2011; 12:463. [PMID: 22132775 PMCID: PMC3245512 DOI: 10.1186/1471-2105-12-463] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2011] [Accepted: 12/01/2011] [Indexed: 02/07/2023] Open
Abstract
Background Our goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models. We inserted gene signatures into three real data sets by altering the expression level of existing probe sets. We varied the number of probe sets perturbed (signature size), the fold increase of mean probe set expression in perturbed compared to unperturbed data (signature strength) and the number of samples perturbed. Prediction models were trained to identify which cases had been perturbed. Performance was estimated using Monte-Carlo cross validation. Results Signature strength had the greatest influence on predictor performance. It was possible to develop almost perfect predictors with as few as 10 features if the fold difference in mean expression values were > 2 even when the spiked samples represented 10% of all samples. We also assessed the gene signature set size and strength for 9 real clinical prediction problems in six different breast cancer data sets. Conclusions We found sufficiently large and strong predictive signatures only for distinguishing ER-positive from ER-negative cancers, there were no strong signatures for more subtle prediction problems. Current statistical methods efficiently identify highly informative features in gene expression data if such features exist and accurate models can be built with as few as 10 highly informative features. Features can be considered highly informative if at least 2-fold expression difference exists between comparison groups but such features do not appear to be common for many clinically relevant prediction problems in human data sets.
Collapse
Affiliation(s)
- Kenneth R Hess
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, USA
| | | | | | | | | | | |
Collapse
|
13
|
Sontrop HMJ, Verhaegh WFJ, Reinders MJT, Moerland PD. An evaluation protocol for subtype-specific breast cancer event prediction. PLoS One 2011; 6:e21681. [PMID: 21760900 PMCID: PMC3132736 DOI: 10.1371/journal.pone.0021681] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2011] [Accepted: 06/05/2011] [Indexed: 12/31/2022] Open
Abstract
In recent years increasing evidence appeared that breast cancer may not constitute a single disease at the molecular level, but comprises a heterogeneous set of subtypes. This suggests that instead of building a single monolithic predictor, better predictors might be constructed that solely target samples of a designated subtype, which are believed to represent more homogeneous sets of samples. An unavoidable drawback of developing subtype-specific predictors, however, is that a stratification by subtype drastically reduces the number of samples available for their construction. As numerous studies have indicated sample size to be an important factor in predictor construction, it is therefore questionable whether the potential benefit of subtyping can outweigh the drawback of a severe loss in sample size. Factors like unequal class distributions and differences in the number of samples per subtype, further complicate comparisons. We present a novel experimental protocol that facilitates a comprehensive comparison between subtype-specific predictors and predictors that do not take subtype information into account. Emphasis lies on careful control of sample size as well as class and subtype distributions. The methodology is applied to a large breast cancer compendium involving over 1500 arrays, using a state-of-the-art subtyping scheme. We show that the resulting subtype-specific predictors outperform those that do not take subtype information into account, especially when taking sample size considerations into account.
Collapse
Affiliation(s)
| | - Wim F. J. Verhaegh
- Molecular Diagnostics Department, Philips Research, Eindhoven, The Netherlands
| | - Marcel J. T. Reinders
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands
- Netherlands Bioinformatics Centre, Nijmegen, The Netherlands
| | - Perry D. Moerland
- Bioinformatics Laboratory, Department of Clinical Epidemiology, Biostatistics, and Bioinformatics, Academic Medical Center, Amsterdam, The Netherlands
- Netherlands Bioinformatics Centre, Nijmegen, The Netherlands
- * E-mail:
| |
Collapse
|
14
|
Identifying HIV-1 host cell factors by genome-scale RNAi screening. Methods 2010; 53:3-12. [PMID: 20654720 DOI: 10.1016/j.ymeth.2010.07.009] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2010] [Revised: 07/15/2010] [Accepted: 07/15/2010] [Indexed: 12/30/2022] Open
Abstract
Advances in the application of RNA interference (RNAi) have facilitated the establishment of systematic cell-based loss-of-function screening platforms. Widespread implementation of this technology has enabled genome-wide genetic analysis of a diverse array of cellular phenotypes, including the identification of host cell factors involved in viral replication. Four recent studies employed whole-genome RNAi technologies to elucidate cellular genes important for the replication of HIV-1. While these four genome-scale screens shared a common objective, they differ in their scope and experimental design. In this review we explore alternative strategies for developing RNAi screens, and discuss potential pitfalls of the technology. Important technical considerations include the choice of silencing reagents, experimental systems, assay readout and analysis methods. We focus on experimental and computational parameters that can impact the outcome of high-throughput genetic screens, and provide guidelines for the development of reliable cell-based RNAi screens.
Collapse
|
15
|
Johannes M, Brase JC, Fröhlich H, Gade S, Gehrmann M, Fälth M, Sültmann H, Beissbarth T. Integration of pathway knowledge into a reweighted recursive feature elimination approach for risk stratification of cancer patients. ACTA ACUST UNITED AC 2010; 26:2136-44. [PMID: 20591905 DOI: 10.1093/bioinformatics/btq345] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
MOTIVATION One of the main goals of high-throughput gene-expression studies in cancer research is to identify prognostic gene signatures, which have the potential to predict the clinical outcome. It is common practice to investigate these questions using classification methods. However, standard methods merely rely on gene-expression data and assume the genes to be independent. Including pathway knowledge a priori into the classification process has recently been indicated as a promising way to increase classification accuracy as well as the interpretability and reproducibility of prognostic gene signatures. RESULTS We propose a new method called Reweighted Recursive Feature Elimination. It is based on the hypothesis that a gene with a low fold-change should have an increased influence on the classifier if it is connected to differentially expressed genes. We used a modified version of Google's PageRank algorithm to alter the ranking criterion of the SVM-RFE algorithm. Evaluations of our method on an integrated breast cancer dataset comprising 788 samples showed an improvement of the area under the receiver operator characteristic curve as well as in the reproducibility and interpretability of selected genes. AVAILABILITY The R code of the proposed algorithm is given in Supplementary Material.
Collapse
Affiliation(s)
- Marc Johannes
- German Cancer Research Center, Cancer Genome Research, Im Neuenheimer Feld 280, 69120 Heidelberg.
| | | | | | | | | | | | | | | |
Collapse
|
16
|
Xu JZ, Wong CW. Hunting for robust gene signature from cancer profiling data: sources of variability, different interpretations, and recent methodological developments. Cancer Lett 2010; 296:9-16. [PMID: 20579805 DOI: 10.1016/j.canlet.2010.05.008] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2010] [Revised: 05/05/2010] [Accepted: 05/18/2010] [Indexed: 12/24/2022]
Abstract
Gene microarray is a powerful platform to investigate the expression patterns of thousands of genes simultaneously. One central objective of such analysis is to select sets of genes (i.e., gene signatures) which correlate with clinical characteristics, such as disease subtype diagnosis, response to drug treatment and prognosis. However, previous studies have found that mRNA signatures are highly unstable and strongly depend on the selection of patient samples. Based on five large microRNA profiling datasets, we empirically found that microRNA signatures are also generally unstable. Therefore, concerns arise regarding the reproducibility and clinical applicability of these derived gene signatures. Here, we first provide a brief review on the sources of variability and different interpretations of multiple distinct gene signatures. We then focus on those recent methodological progresses aimed at developing more stable gene signatures.
Collapse
Affiliation(s)
- Jian-Zhen Xu
- College of Bioengineering, Henan University of Technology, Zhengzhou 450001, China.
| | | |
Collapse
|