251
|
Miller RA. Computer-assisted diagnostic decision support: history, challenges, and possible paths forward. ADVANCES IN HEALTH SCIENCES EDUCATION : THEORY AND PRACTICE 2009; 14 Suppl 1:89-106. [PMID: 19672686 DOI: 10.1007/s10459-009-9186-y] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2009] [Accepted: 07/14/2009] [Indexed: 05/28/2023]
Abstract
This paper presents a brief history of computer-assisted diagnosis, including challenges and future directions. Some ideas presented in this article on computer-assisted diagnostic decision support systems (CDDSS) derive from prior work by the author and his colleagues (see list in Acknowledgments) on the INTERNIST-1 and QMR projects. References indicate the original sources of many of these ideas.
Collapse
Affiliation(s)
- Randolph A Miller
- Department of Biomedical Informatics, Eskind Biomedical Library, Vanderbilt University Medical Center, Room B003C, Nashville, TN 37232-8340, USA.
| |
Collapse
|
252
|
The FAST-AIMS Clinical Mass Spectrometry Analysis System. Adv Bioinformatics 2009:598241. [PMID: 19956420 PMCID: PMC2775698 DOI: 10.1155/2009/598241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2009] [Accepted: 05/11/2009] [Indexed: 11/19/2022] Open
Abstract
Within clinical proteomics, mass spectrometry analysis of biological samples is emerging as an important high-throughput technology, capable of producing powerful diagnostic and prognostic models and identifying important disease biomarkers. As interest in this area grows, and the number of such proteomics datasets continues to increase, the need has developed for efficient, comprehensive, reproducible methods of mass spectrometry data analysis by both experts and nonexperts. We have designed and implemented a stand-alone software system, FAST-AIMS, which seeks to meet this need through automation of data preprocessing, feature selection, classification model generation, and performance estimation. FAST-AIMS is an efficient and user-friendly stand-alone software for predictive analysis of mass spectrometry data. The present resource review paper will describe the features and use of the FAST-AIMS system. The system is freely available for download for noncommercial use.
Collapse
|
253
|
Combining dissimilarities in a Hyper Reproducing Kernel Hilbert Space for complex human cancer prediction. J Biomed Biotechnol 2009; 2009:906865. [PMID: 19584909 PMCID: PMC2699662 DOI: 10.1155/2009/906865] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2009] [Accepted: 03/24/2009] [Indexed: 11/18/2022] Open
Abstract
DNA microarrays provide rich profiles that are used in
cancer prediction considering the gene expression levels
across a collection of related samples. Support Vector Machines
(SVM) have been applied to the classification of cancer
samples with encouraging results. However, they rely on
Euclidean distances that fail to reflect accurately the proximities
among sample profiles. Then, non-Euclidean dissimilarities
provide additional information that should be considered
to reduce the misclassification errors.
In this paper, we incorporate in the ν-SVM algorithm a
linear combination of non-Euclidean dissimilarities. The
weights of the combination are learnt in a (Hyper
Reproducing Kernel Hilbert Space) HRKHS using a Semidefinite
Programming algorithm. This approach allows us to incorporate
a smoothing term that penalizes the complexity of the
family of distances and avoids overfitting. The experimental results suggest that the method proposed
helps to reduce the misclassification errors in several
human cancer problems.
Collapse
|
254
|
|
255
|
Anand A, Suganthan PN. Multiclass cancer classification by support vector machines with class-wise optimized genes and probability estimates. J Theor Biol 2009; 259:533-40. [PMID: 19406131 DOI: 10.1016/j.jtbi.2009.04.013] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2008] [Revised: 02/11/2009] [Accepted: 04/20/2009] [Indexed: 11/15/2022]
Abstract
We investigate the multiclass classification of cancer microarray samples. In contrast to classification of two cancer types from gene expression data, multiclass classification of more than two cancer types are relatively hard and less studied problem. We used class-wise optimized genes with corresponding one-versus-all support vector machine (OVA-SVM) classifier to maximize the utilization of selected genes. Final prediction was made by using probability scores from all classifiers. We used three different methods of estimating probability from decision value. Among the three probability methods, Platt's approach was more consistent, whereas, isotonic approach performed better for datasets with unequal proportion of samples in different classes. Probability based decision does not only gives true and fair comparison between different one-versus-all (OVA) classifiers but also gives the possibility of using them for any post analysis. Several ensemble experiments, an example of post analysis, of the three probability methods were implemented to study their effect in improving the classification accuracy. We observe that ensemble did help in improving the predictive accuracy of cancer data sets especially involving unbalanced samples. Four-fold external stratified cross-validation experiment was performed on the six multiclass cancer datasets to obtain unbiased estimates of prediction accuracies. Analysis of class-wise frequently selected genes on two cancer datasets demonstrated that the approach was able to select important and relevant genes consistent to literature. This study demonstrates successful implementation of the framework of class-wise feature selection and multiclass classification for prediction of cancer subtypes on six datasets.
Collapse
Affiliation(s)
- Ashish Anand
- School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, S2-B2a-21, Singapore 639798, Singapore
| | | |
Collapse
|
256
|
Yukinawa N, Oba S, Kato K, Ishii S. Optimal aggregation of binary classifiers for multiclass cancer diagnosis using gene expression profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:333-343. [PMID: 19407356 DOI: 10.1109/tcbb.2007.70239] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Multiclass classification is one of the fundamental tasks in bioinformatics and typically arises in cancer diagnosis studies by gene expression profiling. There have been many studies of aggregating binary classifiers to construct a multiclass classifier based on one-versus-the-rest (1R), one-versus-one (11), or other coding strategies, as well as some comparison studies between them. However, the studies found that the best coding depends on each situation. Therefore, a new problem, which we call the "optimal coding problem," has arisen: how can we determine which coding is the optimal one in each situation? To approach this optimal coding problem, we propose a novel framework for constructing a multiclass classifier, in which each binary classifier to be aggregated has a weight value to be optimally tuned based on the observed data. Although there is no a priori answer to the optimal coding problem, our weight tuning method can be a consistent answer to the problem. We apply this method to various classification problems including a synthesized data set and some cancer diagnosis data sets from gene expression profiling. The results demonstrate that, in most situations, our method can improve classification accuracy over simple voting heuristics and is better than or comparable to state-of-the-art multiclass predictors.
Collapse
Affiliation(s)
- Naoto Yukinawa
- Graduate School of Information Sciences, Nara Institute of Science and Technology, Ikoma, Nara, Japan.
| | | | | | | |
Collapse
|
257
|
Paul TK, Iba H. Prediction of cancer class with majority voting genetic programming classifier using gene expression data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:353-367. [PMID: 19407358 DOI: 10.1109/tcbb.2007.70245] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
In order to get a better understanding of different types of cancers and to find the possible biomarkers for diseases, recently, many researchers are analyzing the gene expression data using various machine learning techniques. However, due to a very small number of training samples compared to the huge number of genes and class imbalance, most of these methods suffer from overfitting. In this paper, we present a majority voting genetic programming classifier (MVGPC) for the classification of microarray data. Instead of a single rule or a single set of rules, we evolve multiple rules with genetic programming (GP) and then apply those rules to test samples to determine their labels with majority voting technique. By performing experiments on four different public cancer data sets, including multiclass data sets, we have found that the test accuracies of MVGPC are better than those of other methods, including AdaBoost with GP. Moreover, some of the more frequently occurring genes in the classification rules are known to be associated with the types of cancers being studied in this paper.
Collapse
Affiliation(s)
- Topon Kumar Paul
- System Engineering Laboratory, Corporate Research & Development Center, Toshiba Corporation, Kawasaki-shi, Kanagawa, Japan.
| | | |
Collapse
|
258
|
Chuang LY, Ke CH, Chang HW, Yang CH. A Two-Stage Feature Selection Method for Gene Expression Data. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2009; 13:127-37. [DOI: 10.1089/omi.2008.0083] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Affiliation(s)
- Li-Yeh Chuang
- Institute of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung, Taiwan, Republic of China
| | - Chao-Hsuan Ke
- Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, Republic of China
| | - Hsueh-Wei Chang
- Faculty of Biomedical Science and Environmental Biology, Kaohsiung Medical University, Taiwan, Republic of China
- Graduate Institute of Natural Products, College of Pharmacy, Kaohsiung Medical University, Kaohsiung, Taiwan, Republic of China
- Center of Excellence for Environmental Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan, Republic of China
| | - Cheng-Hong Yang
- Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, Republic of China
| |
Collapse
|
259
|
Aliferis CF, Statnikov A, Tsamardinos I, Schildcrout JS, Shepherd BE, Harrell FE. Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data. PLoS One 2009; 4:e4922. [PMID: 19290050 PMCID: PMC2654113 DOI: 10.1371/journal.pone.0004922] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2008] [Accepted: 02/05/2009] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development. METHODOLOGY/PRINCIPAL FINDINGS We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data. CONCLUSIONS/SIGNIFICANCE THE FINDINGS OF THE PRESENT STUDY HAVE TWO IMPORTANT PRACTICAL IMPLICATIONS: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.
Collapse
Affiliation(s)
- Constantin F Aliferis
- Center of Health Informatics and Bioinformatics, New York University, New York, New York, United States of America.
| | | | | | | | | | | |
Collapse
|
260
|
Sparse representation for classification of tumors using gene expression data. J Biomed Biotechnol 2009; 2009:403689. [PMID: 19300522 PMCID: PMC2655631 DOI: 10.1155/2009/403689] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2008] [Accepted: 01/12/2009] [Indexed: 11/17/2022] Open
Abstract
Personalized drug design requires the classification of cancer patients as accurate as possible. With advances in genome sequencing and microarray technology, a large amount of gene expression data has been and will continuously be produced from various cancerous patients. Such cancer-alerted gene expression data allows us to classify tumors at the genomewide level. However, cancer-alerted gene expression datasets typically have much more number of genes (features) than that of samples (patients), which imposes a challenge for classification of tumors. In this paper, a new method is proposed for cancer diagnosis using gene expression data by casting the classification problem as finding sparse representations of test samples with respect to training samples. The sparse representation is computed by the l(1)-regularized least square method. To investigate its performance, the proposed method is applied to six tumor gene expression datasets and compared with various support vector machine (SVM) methods. The experimental results have shown that the performance of the proposed method is comparable with or better than those of SVMs. In addition, the proposed method is more efficient than SVMs as it has no need of model selection.
Collapse
|
261
|
van Wieringen WN, Kun D, Hampel R, Boulesteix AL. Survival prediction using gene expression data: A review and comparison. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.05.021] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
262
|
Zervakis M, Blazadonakis ME, Tsiliki G, Danilatou V, Tsiknakis M, Kafetzopoulos D. Outcome prediction based on microarray analysis: a critical perspective on methods. BMC Bioinformatics 2009; 10:53. [PMID: 19200394 PMCID: PMC2667512 DOI: 10.1186/1471-2105-10-53] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2008] [Accepted: 02/07/2009] [Indexed: 11/26/2022] Open
Abstract
Background Information extraction from microarrays has not yet been widely used in diagnostic or prognostic decision-support systems, due to the diversity of results produced by the available techniques, their instability on different data sets and the inability to relate statistical significance with biological relevance. Thus, there is an urgent need to address the statistical framework of microarray analysis and identify its drawbacks and limitations, which will enable us to thoroughly compare methodologies under the same experimental set-up and associate results with confidence intervals meaningful to clinicians. In this study we consider gene-selection algorithms with the aim to reveal inefficiencies in performance evaluation and address aspects that can reduce uncertainty in algorithmic validation. Results A computational study is performed related to the performance of several gene selection methodologies on publicly available microarray data. Three basic types of experimental scenarios are evaluated, i.e. the independent test-set and the 10-fold cross-validation (CV) using maximum and average performance measures. Feature selection methods behave differently under different validation strategies. The performance results from CV do not mach well those from the independent test-set, except for the support vector machines (SVM) and the least squares SVM methods. However, these wrapper methods achieve variable (often low) performance, whereas the hybrid methods attain consistently higher accuracies. The use of an independent test-set within CV is important for the evaluation of the predictive power of algorithms. The optimal size of the selected gene-set also appears to be dependent on the evaluation scheme. The consistency of selected genes over variation of the training-set is another aspect important in reducing uncertainty in the evaluation of the derived gene signature. In all cases the presence of outlier samples can seriously affect algorithmic performance. Conclusion Multiple parameters can influence the selection of a gene-signature and its predictive power, thus possible biases in validation methods must always be accounted for. This paper illustrates that independent test-set evaluation reduces the bias of CV, and case-specific measures reveal stability characteristics of the gene-signature over changes of the training set. Moreover, frequency measures on gene selection address the algorithmic consistency in selecting the same gene signature under different training conditions. These issues contribute to the development of an objective evaluation framework and aid the derivation of statistically consistent gene signatures that could eventually be correlated with biological relevance. The benefits of the proposed framework are supported by the evaluation results and methodological comparisons performed for several gene-selection algorithms on three publicly available datasets.
Collapse
Affiliation(s)
- Michalis Zervakis
- Technical University of Crete, Department of Electronic and Computer Engineering, University Campus, Chania, Crete, Greece.
| | | | | | | | | | | |
Collapse
|
263
|
Baralis E, Bruno G, Fiori A. Minimum number of genes for microarray feature selection. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2009; 2008:5692-5. [PMID: 19164009 DOI: 10.1109/iembs.2008.4650506] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
A fundamental problem in microarray analysis is to identify relevant genes from large amounts of expression data. Feature selection aims at identifying a subset of features for building robust learning models. However, finding the optimal number of features is a challenging problem, as it is a trade off between information loss when pruning excessively and noise increase when pruning is too weak. This paper presents a novel representation of genes as strings of bits and a method which automatically selects the minimum number of genes to reach a good classification accuracy on the training set. Our method first eliminates redundant features, which do not add further information for classification, then it exploits a set covering algorithm. Preliminary experimental results on public datasets confirm the intuition of the proposed method leading to high classification accuracy.
Collapse
|
264
|
|
265
|
|
266
|
The Impact of Gene Selection on Imbalanced Microarray Expression Data. BIOINFORMATICS AND COMPUTATIONAL BIOLOGY 2009. [DOI: 10.1007/978-3-642-00727-9_25] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
|
267
|
de Souto MCP, Costa IG, de Araujo DSA, Ludermir TB, Schliep A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 2008; 9:497. [PMID: 19038021 PMCID: PMC2632677 DOI: 10.1186/1471-2105-9-497] [Citation(s) in RCA: 174] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2008] [Accepted: 11/27/2008] [Indexed: 11/28/2022] Open
Abstract
Background The use of clustering methods for the discovery of cancer subtypes has drawn a great deal of attention in the scientific community. While bioinformaticians have proposed new clustering methods that take advantage of characteristics of the gene expression data, the medical community has a preference for using "classic" clustering methods. There have been no studies thus far performing a large-scale evaluation of different clustering methods in this context. Results/Conclusion We present the first large-scale analysis of seven different clustering methods and four proximity measures for the analysis of 35 cancer gene expression data sets. Our results reveal that the finite mixture of Gaussians, followed closely by k-means, exhibited the best performance in terms of recovering the true structure of the data sets. These methods also exhibited, on average, the smallest difference between the actual number of classes in the data sets and the best number of clusters as indicated by our validation criteria. Furthermore, hierarchical methods, which have been widely used by the medical community, exhibited a poorer recovery performance than that of the other methods evaluated. Moreover, as a stable basis for the assessment and comparison of different clustering methods for cancer gene expression data, this study provides a common group of data sets (benchmark data sets) to be shared among researchers and used for comparisons with new methods. The data sets analyzed in this study are available at .
Collapse
Affiliation(s)
- Marcilio C P de Souto
- Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | | | | | | | |
Collapse
|
268
|
Tenenbaum JD, Walker MG, Utz PJ, Butte AJ. Expression-based Pathway Signature Analysis (EPSA): mining publicly available microarray data for insight into human disease. BMC Med Genomics 2008; 1:51. [PMID: 18937865 PMCID: PMC2588448 DOI: 10.1186/1755-8794-1-51] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2008] [Accepted: 10/20/2008] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Publicly available data repositories facilitate the sharing of an ever-increasing amount of microarray data. However, these datasets remain highly underutilized. Reutilizing the data could offer insights into questions and diseases entirely distinct from those considered in the original experimental design. METHODS We first analyzed microarray datasets derived from known perturbations of specific pathways using the samr package in R to identify specific patterns of change in gene expression. We refer to these pattern of gene expression alteration as a "pathway signatures." We then used Spearman's rank correlation coefficient, a non-parametric measure of correlation, to determine similarities between pathway signatures and disease profiles, and permutation analysis to evaluate false discovery rate. This enabled detection of statistically significant similarity between these pathway signatures and corresponding changes observed in human disease. Finally, we evaluated pathway activation, as indicated by correlation with the pathway signature, as a risk factor for poor prognosis using multiple unrelated, publicly available datasets. RESULTS We have developed a novel method, Expression-based Pathway Signature Analysis (EPSA). We demonstrate that ESPA is a rigorous computational approach for statistically evaluating the degree of similarity between highly disparate sources of microarray expression data. We also show how EPSA can be used in a number of cases to stratify patients with differential disease prognosis. EPSA can be applied to many different types of datasets in spite of different platforms, different experimental designs, and different species. Applying this method can yield new insights into human disease progression. CONCLUSION EPSA enables the use of publicly available data for an entirely new, translational purpose to enable the identification of potential pathways of dysregulation in human disease, as well as potential leads for therapeutic molecular targets.
Collapse
Affiliation(s)
- Jessica D Tenenbaum
- Stanford Medical Informatics, 251 Campus Drive MSOB x215, Stanford, CA 94305, USA.
| | | | | | | |
Collapse
|
269
|
Slawski M, Daumer M, Boulesteix AL. CMA: a comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 2008; 9:439. [PMID: 18925941 PMCID: PMC2646186 DOI: 10.1186/1471-2105-9-439] [Citation(s) in RCA: 78] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2008] [Accepted: 10/16/2008] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND For the last eight years, microarray-based classification has been a major topic in statistics, bioinformatics and biomedicine research. Traditional methods often yield unsatisfactory results or may even be inapplicable in the so-called "p >> n" setting where the number of predictors p by far exceeds the number of observations n, hence the term "ill-posed-problem". Careful model selection and evaluation satisfying accepted good-practice standards is a very complex task for statisticians without experience in this area or for scientists with limited statistical background. The multiplicity of available methods for class prediction based on high-dimensional data is an additional practical challenge for inexperienced researchers. RESULTS In this article, we introduce a new Bioconductor package called CMA (standing for "Classification for MicroArrays") for automatically performing variable selection, parameter tuning, classifier construction, and unbiased evaluation of the constructed classifiers using a large number of usual methods. Without much time and effort, users are provided with an overview of the unbiased accuracy of most top-performing classifiers. Furthermore, the standardized evaluation framework underlying CMA can also be beneficial in statistical research for comparison purposes, for instance if a new classifier has to be compared to existing approaches. CONCLUSION CMA is a user-friendly comprehensive package for classifier construction and evaluation implementing most usual approaches. It is freely available from the Bioconductor website at (http://bioconductor.org/packages/2.3/bioc/html/CMA.html).
Collapse
Affiliation(s)
- M Slawski
- Sylvia Lawry Centre for Multiple Sclerosis Research, Hohenlindenerstr. 1, D-81677 Munich, Germany
| | - M Daumer
- Sylvia Lawry Centre for Multiple Sclerosis Research, Hohenlindenerstr. 1, D-81677 Munich, Germany
| | - A-L Boulesteix
- Sylvia Lawry Centre for Multiple Sclerosis Research, Hohenlindenerstr. 1, D-81677 Munich, Germany
- Department of Statistics, University of Munich, Ludwigstr. 33, D-80539 Munich, Germany
| |
Collapse
|
270
|
Tsai YS, Lin CT, Tseng GC, Chung IF, Pal NR. Discovery of dominant and dormant genes from expression data using a novel generalization of SNR for multi-class problems. BMC Bioinformatics 2008; 9:425. [PMID: 18842155 PMCID: PMC2620271 DOI: 10.1186/1471-2105-9-425] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2008] [Accepted: 10/09/2008] [Indexed: 12/14/2022] Open
Abstract
Background The Signal-to-Noise-Ratio (SNR) is often used for identification of biomarkers for two-class problems and no formal and useful generalization of SNR is available for multiclass problems. We propose innovative generalizations of SNR for multiclass cancer discrimination through introduction of two indices, Gene Dominant Index and Gene Dormant Index (GDIs). These two indices lead to the concepts of dominant and dormant genes with biological significance. We use these indices to develop methodologies for discovery of dominant and dormant biomarkers with interesting biological significance. The dominancy and dormancy of the identified biomarkers and their excellent discriminating power are also demonstrated pictorially using the scatterplot of individual gene and 2-D Sammon's projection of the selected set of genes. Using information from the literature we have shown that the GDI based method can identify dominant and dormant genes that play significant roles in cancer biology. These biomarkers are also used to design diagnostic prediction systems. Results and discussion To evaluate the effectiveness of the GDIs, we have used four multiclass cancer data sets (Small Round Blue Cell Tumors, Leukemia, Central Nervous System Tumors, and Lung Cancer). For each data set we demonstrate that the new indices can find biologically meaningful genes that can act as biomarkers. We then use six machine learning tools, Nearest Neighbor Classifier (NNC), Nearest Mean Classifier (NMC), Support Vector Machine (SVM) classifier with linear kernel, and SVM classifier with Gaussian kernel, where both SVMs are used in conjunction with one-vs-all (OVA) and one-vs-one (OVO) strategies. We found GDIs to be very effective in identifying biomarkers with strong class specific signatures. With all six tools and for all data sets we could achieve better or comparable prediction accuracies usually with fewer marker genes than results reported in the literature using the same computational protocols. The dominant genes are usually easy to find while good dormant genes may not always be available as dormant genes require stronger constraints to be satisfied; but when they are available, they can be used for authentication of diagnosis. Conclusion Since GDI based schemes can find a small set of dominant/dormant biomarkers that is adequate to design diagnostic prediction systems, it opens up the possibility of using real-time qPCR assays or antibody based methods such as ELISA for an easy and low cost diagnosis of diseases. The dominant and dormant genes found by GDIs can be used in different ways to design more reliable diagnostic prediction systems.
Collapse
Affiliation(s)
- Yu-Shuen Tsai
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan.
| | | | | | | | | |
Collapse
|
271
|
Hong JH, Cho SB. A probabilistic multi-class strategy of one-vs.-rest support vector machines for cancer classification. Neurocomputing 2008. [DOI: 10.1016/j.neucom.2008.04.033] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
272
|
Armañanzas R, Inza I, Larrañaga P. Detecting reliable gene interactions by a hierarchy of Bayesian network classifiers. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2008; 91:110-121. [PMID: 18433926 DOI: 10.1016/j.cmpb.2008.02.010] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/05/2007] [Revised: 02/08/2008] [Accepted: 02/28/2008] [Indexed: 05/26/2023]
Abstract
The main purpose of a gene interaction network is to map the relationships of the genes that are out of sight when a genomic study is tackled. DNA microarrays allow the measure of gene expression of thousands of genes at the same time. These data constitute the numeric seed for the induction of the gene networks. In this paper, we propose a new approach to build gene networks by means of Bayesian classifiers, variable selection and bootstrap resampling. The interactions induced by the Bayesian classifiers are based both on the expression levels and on the phenotype information of the supervised variable. Feature selection and bootstrap resampling add reliability and robustness to the overall process removing the false positive findings. The consensus among all the induced models produces a hierarchy of dependences and, thus, of variables. Biologists can define the depth level of the model hierarchy so the set of interactions and genes involved can vary from a sparse to a dense set. Experimental results show how these networks perform well on classification tasks. The biological validation matches previous biological findings and opens new hypothesis for future studies.
Collapse
Affiliation(s)
- Rubén Armañanzas
- Department of Computer Science and Artificial Intelligence, University of the Basque Country, Paseo Manuel Lardizabal 1, 20018 Donostia-San Sebastián, Gipuzkoa, Spain.
| | | | | |
Collapse
|
273
|
Blazadonakis ME, Zervakis M. Wrapper filtering criteria via linear neuron and kernel approaches. Comput Biol Med 2008; 38:894-912. [PMID: 18656182 DOI: 10.1016/j.compbiomed.2008.05.005] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2007] [Revised: 03/08/2008] [Accepted: 05/16/2008] [Indexed: 10/21/2022]
Abstract
OBJECTIVE The problem of marker selection in DNA microarray analysis has been addressed so far by two basic types of approaches, the so-called filter and wrapper methods. Wrapper methods operate in a recursive fashion where feature (gene) weights are re-evaluated and dynamically changing from iteration to iteration, while in filter methods feature weights remain fixed. Our objective in this study is to show that the application of filter criteria in a recursive fashion, where weights are potentially adjusted from cycle to cycle, produces noticeable improvement on the generalization performance measured on independent test sets. METHODS AND MATERIALS Toward this direction we explore the behavior of two well known and broadly accepted pattern recognition approaches namely the support vector machines (SVM) and a single linear neuron (LN), properly adapted to the problem of marker selection. Within this context we also show how the kernel ability of SVM could be employed in a practical manner to provide alternative ways to approach the problem of reliable marker selection. RESULTS We explore how the proposed approaches behave in two application domains (breast cancer and leukemia), achieving comparable or even better results than those reported in the related bibliography. An important advantage of these approaches is their ability to derive stable performance without deteriorating due to the complexity of the application domain. Validation is performed using internal leave one out (ILOO) and 10-fold cross validation as well as independent test set evaluation. CONCLUSIONS Results show that the proposed methodologies achieve remarkable performance and indicate that applying filter criteria in a wrapper fashion ('wrapper filtering criteria') provides a useful tool for marker selection. The contribution of this study is threefold. First it provides a methodology to apply filter criteria in a wrapper way (which is a new approach), second it introduces a fundamental pattern recognition component namely the single neuron (which is a linear estimator) and explores its behavior on marker selection and third it demonstrates an approach to exploit the kernel ability of SVMs in a practical and effective manner.
Collapse
Affiliation(s)
- Michalis E Blazadonakis
- Department of Electronic and Computer Engineering, Technical University of Crete, University Campus, Chania Crete 73100, Greece.
| | | |
Collapse
|
274
|
Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 2008; 9:319. [PMID: 18647401 PMCID: PMC2492881 DOI: 10.1186/1471-2105-9-319] [Citation(s) in RCA: 298] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2008] [Accepted: 07/22/2008] [Indexed: 12/17/2022] Open
Abstract
Background Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. Results In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. Conclusion We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.
Collapse
Affiliation(s)
- Alexander Statnikov
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA.
| | | | | |
Collapse
|
275
|
Liu W, Yuan K, Ye D. On alpha-divergence based nonnegative matrix factorization for clustering cancer gene expression data. Artif Intell Med 2008; 44:1-5. [PMID: 18602254 DOI: 10.1016/j.artmed.2008.05.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2007] [Revised: 05/13/2008] [Accepted: 05/13/2008] [Indexed: 10/21/2022]
Abstract
OBJECTIVE Nonnegative matrix factorization (NMF) has been proven to be a powerful clustering method. Recently Cichocki and coauthors have proposed a family of new algorithms based on the alpha-divergence for NMF. However, it is an open problem to choose an optimal alpha. METHODS AND MATERIALS In this paper, we tested such NMF variant with different alpha values on clustering cancer gene expression data for optimal alpha selection experimentally with 11 datasets. RESULTS AND CONCLUSION Our experimental results show that alpha=1 and 2 are two special optimal cases for real applications.
Collapse
Affiliation(s)
- Weixiang Liu
- Research Center of Biomedical Engineering, Life Science Division, Graduate school at Shenzhen, Tsinghua University, Shenzhen 518055, China.
| | | | | |
Collapse
|
276
|
Boulesteix AL, Porzelius C, Daumer M. Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value. Bioinformatics 2008; 24:1698-706. [PMID: 18544547 DOI: 10.1093/bioinformatics/btn262] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
277
|
Yang CS, Chuang LY, Ke CH, Yang CH. A Combination of Shuffled Frog-Leaping Algorithm and Genetic Algorithm for Gene Selection. JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS 2008. [DOI: 10.20965/jaciii.2008.p0218] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Microarray data referencing to gene expression profiles provides valuable answers to a variety of problems, and contributes to advances in clinical medicine. The application of microarray data to the classification of cancer types has recently assumed increasing importance. The classification of microarray data samples involves feature selection, whose goal is to identify subsets of differentially expressed gene potentially relevant for distinguishing sample classes and classifier design. We propose an efficient evolutionary approach for selecting gene subsets from gene expression data that effectively achieves higher accuracy for classification problems. Our proposal combines a shuffled frog-leaping algorithm (SFLA) and a genetic algorithm (GA), and chooses genes (features) related to classification. The K-nearest neighbor (KNN) with leave-one-out cross validation (LOOCV) is used to evaluate classification accuracy. We apply a novel hybrid approach based on SFLA-GA and KNN classification and compare 11 classification problems from the literature. Experimental results show that classification accuracy obtained using selected features was higher than the accuracy of datasets without feature selection.
Collapse
|
278
|
Judson R, Elloumi F, Setzer RW, Li Z, Shah I. A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model. BMC Bioinformatics 2008; 9:241. [PMID: 18489778 PMCID: PMC2409339 DOI: 10.1186/1471-2105-9-241] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2008] [Accepted: 05/19/2008] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND Bioactivity profiling using high-throughput in vitro assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several public efforts are aimed at discovering patterns or classifiers in high-dimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints. Supervised machine learning is a powerful approach to discover combinatorial relationships in complex in vitro/in vivo datasets. We present a novel model to simulate complex chemical-toxicology data sets and use this model to evaluate the relative performance of different machine learning (ML) methods. RESULTS The classification performance of Artificial Neural Networks (ANN), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Naïve Bayes (NB), Recursive Partitioning and Regression Trees (RPART), and Support Vector Machines (SVM) in the presence and absence of filter-based feature selection was analyzed using K-way cross-validation testing and independent validation on simulated in vitro assay data sets with varying levels of model complexity, number of irrelevant features and measurement noise. While the prediction accuracy of all ML methods decreased as non-causal (irrelevant) features were added, some ML methods performed better than others. In the limit of using a large number of features, ANN and SVM were always in the top performing set of methods while RPART and KNN (k = 5) were always in the poorest performing set. The addition of measurement noise and irrelevant features decreased the classification accuracy of all ML methods, with LDA suffering the greatest performance degradation. LDA performance is especially sensitive to the use of feature selection. Filter-based feature selection generally improved performance, most strikingly for LDA. CONCLUSION We have developed a novel simulation model to evaluate machine learning methods for the analysis of data sets in which in vitro bioassay data is being used to predict in vivo chemical toxicology. From our analysis, we can recommend that several ML methods, most notably SVM and ANN, are good candidates for use in real world applications in this area.
Collapse
Affiliation(s)
- Richard Judson
- National Center for Computational Toxicology, Office of Research and Development, US Environmental Protection Agency, Research Triangle Park, North Carolina 27711, USA.
| | | | | | | | | |
Collapse
|
279
|
Apiletti D, Baralis E, Bruno G, Fiori A. The painter's feature selection for gene expression data. ACTA ACUST UNITED AC 2008; 2007:4227-30. [PMID: 18002935 DOI: 10.1109/iembs.2007.4353269] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Feature selection is a fundamental task in microarray data analysis. It aims at identifying the genes which are mostly associated with a tissue category, disease state or clinical outcome. An effective feature selection reduces computation costs and increases classification accuracy. This paper presents a novel multi-class approach to feature selection for gene expression data, which is called Painter's approach. It has the benefits of both a parameter free technique and a native multicategory method. It consists of two phases. The first is a filtering phase that smooths the effect of noise and outliers, which represent a common problem in microarray data. In the second phase, the actual gene selection is performed. Preliminary experimental results on three public datasets are presented. They confirm the intuition of the proposed approach leading to high classification accuracies.
Collapse
|
280
|
Parikh AA, Johnson JC, Merchant NB. Genomics and Proteomics in Predicting Cancer Outcomes. Surg Oncol Clin N Am 2008; 17:257-77, vii. [DOI: 10.1016/j.soc.2007.12.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
281
|
Boulesteix AL, Kondylis A, Krämer N. Comments on: Augmenting the bootstrap to analyze high dimensional genomic data. TEST-SPAIN 2008. [DOI: 10.1007/s11749-008-0103-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
282
|
Grant GR, Manduchi E, Stoeckert CJ. Analysis and management of microarray gene expression data. ACTA ACUST UNITED AC 2008; Chapter 19:Unit 19.6. [PMID: 18265395 DOI: 10.1002/0471142727.mb1906s77] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Microarray experiments require careful planning and choice of analysis tools in order to get the most out of the data generated, especially considering the associated significant cost and effort. Microarray experiments also require careful documentation, often residing in local databases and/or submitted to public repositories. An often bewildering assortment of choices is available for experimental design, data preprocessing, data analysis (e.g., differential gene expression, classification), and data management. This unit covers the basic steps and common applications for planning, data processing, and data management of microarray experiments, and provides guidance to making choices based on the goals and practical realities of the experiment, as well as the authors' experience in this area.
Collapse
Affiliation(s)
- Gregory R Grant
- University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
| | | | | |
Collapse
|
283
|
Boulesteix AL, Strobl C, Augustin T, Daumer M. Evaluating microarray-based classifiers: an overview. Cancer Inform 2008; 6:77-97. [PMID: 19259405 PMCID: PMC2623308 DOI: 10.4137/cin.s408] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
For the last eight years, microarray-based class prediction has been the subject of numerous publications in medicine, bioinformatics and statistics journals. However, in many articles, the assessment of classification accuracy is carried out using suboptimal procedures and is not paid much attention. In this paper, we carefully review various statistical aspects of classifier evaluation and validation from a practical point of view. The main topics addressed are accuracy measures, error rate estimation procedures, variable selection, choice of classifiers and validation strategy.
Collapse
Affiliation(s)
- A-L Boulesteix
- Sylvia Lawry Centre for MS Research (SLC), Hohenlindenerstr. 1, Munich, Germany
| | | | | | | |
Collapse
|
284
|
Abstract
This review provides a focused summary of the implications of high-dimensional data spaces produced by gene expression microarrays for building better models of cancer diagnosis, prognosis, and therapeutics. We identify the unique challenges posed by high dimensionality to highlight methodological problems and discuss recent methods in predictive classification, unsupervised subclass discovery, and marker identification.
Collapse
|
285
|
Chuang LY, Chang HW, Tu CJ, Yang CH. Improved binary PSO for feature selection using gene expression data. Comput Biol Chem 2008; 32:29-37. [DOI: 10.1016/j.compbiolchem.2007.09.005] [Citation(s) in RCA: 381] [Impact Index Per Article: 22.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2006] [Accepted: 09/10/2007] [Indexed: 11/27/2022]
|
286
|
Sanden SV, Lin D, Burzykowski T. Performance of Gene Selection and Classification Methods in a Microarray Setting: A Simulation Study. COMMUN STAT-SIMUL C 2008. [DOI: 10.1080/03610910701792554] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
287
|
Automated Discrimination of Pathological Regions in Tissue Images: Unsupervised Clustering vs. Supervised SVM Classification. ACTA ACUST UNITED AC 2008. [DOI: 10.1007/978-3-540-92219-3_26] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
288
|
Computational Intelligence Algorithms and DNA Microarrays. ACTA ACUST UNITED AC 2008. [DOI: 10.1007/978-3-540-76803-6_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
289
|
Identification of Tumor Evolution Patterns by Means of Inductive Logic Programming. GENOMICS, PROTEOMICS & BIOINFORMATICS 2008; 6:91-7. [PMID: 18973865 PMCID: PMC5054107 DOI: 10.1016/s1672-0229(08)60024-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
In considering key events of genomic disorders in the development and progression of cancer, the correlation between genomic instability and carcinogenesis is currently under investigation. In this work, we propose an inductive logic programming approach to the problem of modeling evolution patterns for breast cancer. Using this approach, it is possible to extract fingerprints of stages of the disease that can be used in order to develop and deliver the most adequate therapies to patients. Furthermore, such a model can help physicians and biologists in the elucidation of molecular dynamics underlying the aberrations-waterfall model behind carcinogenesis. By showing results obtained on a real-world dataset, we try to give some hints about further approach to the knowledge-driven validations of such hypotheses.
Collapse
|
290
|
Ressom HW, Varghese RS, Zhang Z, Xuan J, Clarke R. Classification algorithms for phenotype prediction in genomics and proteomics. FRONT BIOSCI-LANDMRK 2008; 13:691-708. [PMID: 17981580 DOI: 10.2741/2712] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
This paper gives an overview of statistical and machine learning-based feature selection and pattern classification algorithms and their application in molecular cancer classification or phenotype prediction. In particular, the paper focuses on the use of these computational methods for gene and peak selection from microarray and mass spectrometry data, respectively. The selected features are presented to a classifier for phenotype prediction.
Collapse
Affiliation(s)
- Habtom W Ressom
- Lombardi Comprehensive Cancer Center, 3970 Reservoir Rd NW, Washington, DC 20057, USA.
| | | | | | | | | |
Collapse
|
291
|
Liu W, Yuan K, Ye D. Reducing microarray data via nonnegative matrix factorization for visualization and clustering analysis. J Biomed Inform 2007; 41:602-6. [PMID: 18234564 DOI: 10.1016/j.jbi.2007.12.003] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2007] [Revised: 12/11/2007] [Accepted: 12/14/2007] [Indexed: 11/26/2022]
Abstract
In microarray data analysis, each gene expression sample has thousands of genes and reducing such high dimensionality is useful for both visualization and further clustering of samples. Traditional principal component analysis (PCA) is a commonly used method which has problems. Nonnegative Matrix Factorization (NMF) is a new dimension reduction method. In this paper we compare NMF and PCA for dimension reduction. The reduced data is used for visualization, and clustering analysis via k-means on 11 real gene expression datasets. Before the clustering analysis, we apply NMF and PCA for reduction in visualization. The results on one leukemia dataset show that NMF can discover natural clusters and clearly detect one mislabeled sample while PCA cannot. For clustering analysis via k-means, NMF most typically outperforms PCA. Our results demonstrate the superiority of NMF over PCA in reducing microarray data.
Collapse
Affiliation(s)
- Weixiang Liu
- Research Center of Biomedical Engineering, Life Science Division, Graduate School at Shenzhen, Tsinghua University, Shenzhen 518055, China.
| | | | | |
Collapse
|
292
|
Bellazzi R, Zupan B. Towards knowledge-based gene expression data mining. J Biomed Inform 2007; 40:787-802. [PMID: 17683991 DOI: 10.1016/j.jbi.2007.06.005] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2006] [Revised: 04/20/2007] [Accepted: 06/06/2007] [Indexed: 11/24/2022]
Abstract
The field of gene expression data analysis has grown in the past few years from being purely data-centric to integrative, aiming at complementing microarray analysis with data and knowledge from diverse available sources. In this review, we report on the plethora of gene expression data mining techniques and focus on their evolution toward knowledge-based data analysis approaches. In particular, we discuss recent developments in gene expression-based analysis methods used in association and classification studies, phenotyping and reverse engineering of gene networks.
Collapse
Affiliation(s)
- Riccardo Bellazzi
- Dipartimento di Informatica e Sistemistica, Università di Pavia, via Ferrata 1, I-27100 Pavia, Italy
| | | |
Collapse
|
293
|
A comparative analysis of predictive models of morbidity in intensive care unit after cardiac surgery - part I: model planning. BMC Med Inform Decis Mak 2007; 7:35. [PMID: 18034872 PMCID: PMC2212627 DOI: 10.1186/1472-6947-7-35] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2007] [Accepted: 11/22/2007] [Indexed: 11/30/2022] Open
Abstract
Background Different methods have recently been proposed for predicting morbidity in intensive care units (ICU). The aim of the present study was to critically review a number of approaches for developing models capable of estimating the probability of morbidity in ICU after heart surgery. The study is divided into two parts. In this first part, popular models used to estimate the probability of class membership are grouped into distinct categories according to their underlying mathematical principles. Modelling techniques and intrinsic strengths and weaknesses of each model are analysed and discussed from a theoretical point of view, in consideration of clinical applications. Methods Models based on Bayes rule, k-nearest neighbour algorithm, logistic regression, scoring systems and artificial neural networks are investigated. Key issues for model design are described. The mathematical treatment of some aspects of model structure is also included for readers interested in developing models, though a full understanding of mathematical relationships is not necessary if the reader is only interested in perceiving the practical meaning of model assumptions, weaknesses and strengths from a user point of view. Results Scoring systems are very attractive due to their simplicity of use, although this may undermine their predictive capacity. Logistic regression models are trustworthy tools, although they suffer from the principal limitations of most regression procedures. Bayesian models seem to be a good compromise between complexity and predictive performance, but model recalibration is generally necessary. k-nearest neighbour may be a valid non parametric technique, though computational cost and the need for large data storage are major weaknesses of this approach. Artificial neural networks have intrinsic advantages with respect to common statistical models, though the training process may be problematical. Conclusion Knowledge of model assumptions and the theoretical strengths and weaknesses of different approaches are fundamental for designing models for estimating the probability of morbidity after heart surgery. However, a rational choice also requires evaluation and comparison of actual performances of locally-developed competitive models in the clinical scenario to obtain satisfactory agreement between local needs and model response. In the second part of this study the above predictive models will therefore be tested on real data acquired in a specialized ICU.
Collapse
|
294
|
Abstract
To respond to potential adverse exposures properly, health care providers need accurate indicators of exposure levels. The indicators are particularly important in the case of acetaminophen (APAP) intoxication, the leading cause of liver failure in the U.S. We hypothesized that gene expression patterns derived from blood cells would provide useful indicators of acute exposure levels. To test this hypothesis, we used a blood gene expression data set from rats exposed to APAP to train classifiers in two prediction algorithms and to extract patterns for prediction using a profiling algorithm. Prediction accuracy was tested on a blinded, independent rat blood test data set and ranged from 88.9% to 95.8%. Genomic markers outperformed predictions based on traditional clinical parameters. The expression profiles of the predictor genes from the patterns extracted from the blood exhibited remarkable (97% accuracy) transtissue APAP exposure prediction when liver gene expression data were used as a test set. Analysis of human samples revealed separation of APAP-intoxicated patients from control individuals based on blood expression levels of human orthologs of the rat discriminatory genes. The major biological signal in the discriminating genes was activation of an inflammatory response after exposure to toxic doses of APAP. These results support the hypothesis that gene expression data from peripheral blood cells can provide valuable information about exposure levels, well before liver damage is detected by classical parameters. It also supports the potential use of genomic markers in the blood as surrogates for clinical markers of potential acute liver damage.
Collapse
|
295
|
Zhang JG, Deng HW. Gene selection for classification of microarray data based on the Bayes error. BMC Bioinformatics 2007; 8:370. [PMID: 17915022 PMCID: PMC2089123 DOI: 10.1186/1471-2105-8-370] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2007] [Accepted: 10/03/2007] [Indexed: 11/10/2022] Open
Abstract
Background With DNA microarray data, selecting a compact subset of discriminative genes from thousands of genes is a critical step for accurate classification of phenotypes for, e.g., disease diagnosis. Several widely used gene selection methods often select top-ranked genes according to their individual discriminative power in classifying samples into distinct categories, without considering correlations among genes. A limitation of these gene selection methods is that they may result in gene sets with some redundancy and yield an unnecessary large number of candidate genes for classification analyses. Some latest studies show that incorporating gene to gene correlations into gene selection can remove redundant genes and improve classification accuracy. Results In this study, we propose a new method, Based Bayes error Filter (BBF), to select relevant genes and remove redundant genes in classification analyses of microarray data. The effectiveness and accuracy of this method is demonstrated through analyses of five publicly available microarray datasets. The results show that our gene selection method is capable of achieving better accuracies than previous studies, while being able to effectively select relevant genes, remove redundant genes and obtain efficient and small gene sets for sample classification purposes. Conclusion The proposed method can effectively identify a compact set of genes with high classification accuracy. This study also indicates that application of the Bayes error is a feasible and effective wayfor removing redundant genes in gene selection.
Collapse
Affiliation(s)
- Ji-Gang Zhang
- Departments of Orthopedic Surgery and Basic Medical Science, School of Medicine, University of Missouri-Kansas City, 2411 Holmes Street, Kansas City, MO 64108, USA
| | - Hong-Wen Deng
- Laboratory of Molecular and Statistical Genetics, College of Life Sciences, Hunan Normal University, Changsha, Hunan 410081, P. R. China
- The Key Laboratory of Biomedical Information Engineering of Ministry of Education and Institute of Molecular Genetics, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an 710049, P. R. China
- Departments of Orthopedic Surgery and Basic Medical Science, School of Medicine, University of Missouri-Kansas City, 2411 Holmes Street, Kansas City, MO 64108, USA
| |
Collapse
|
296
|
Kawai T, Morita K, Masuda K, Nishida K, Shikishima M, Ohta M, Saito T, Rokutan K. Gene expression signature in peripheral blood cells from medical students exposed to chronic psychological stress. Biol Psychol 2007; 76:147-55. [PMID: 17766027 DOI: 10.1016/j.biopsycho.2007.07.008] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2007] [Revised: 07/13/2007] [Accepted: 07/14/2007] [Indexed: 01/22/2023]
Abstract
To assess response to chronic psychological stress, gene expression profiles in peripheral blood from 18 medical students confronting license examination were analyzed using an original microarray. Total RNA was collected from each subject 9 months before the examination and mixed to be used as a universal control. At that time, most students had normal scores on the state-trait anxiety inventory (STAI). However, STAI scores were significantly elevated at 2 months and at 2 days before the examination. Pattern of the gene expression profile was more uniform 2 days before than 2 months before the examination. We identified 24 genes that significantly and uniformly changed from the universal control 2 days before the examination. Of the 24 genes, real-time PCR validated changes in mRNA levels of 10 (PLCB2, CSF3R, ARHGEF1, DPYD, CTNNB1, PPP3CA, POLM, IRF3, TP53, and CCNI). The identified genes may be useful to assess chronic psychological stress response.
Collapse
Affiliation(s)
- Tomoko Kawai
- Department of Stress Science, Institute of Health Biosciences, University of Tokushima Graduate School, 3-18-15 Kuramoto-cho, Tokushima, Japan
| | | | | | | | | | | | | | | |
Collapse
|
297
|
Statnikov A, Li C, Aliferis CF. Effects of environment, genetics and data analysis pitfalls in an esophageal cancer genome-wide association study. PLoS One 2007; 2:e958. [PMID: 17895998 PMCID: PMC1978529 DOI: 10.1371/journal.pone.0000958] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2007] [Accepted: 08/30/2007] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND The development of new high-throughput genotyping technologies has allowed fast evaluation of single nucleotide polymorphisms (SNPs) on a genome-wide scale. Several recent genome-wide association studies employing these technologies suggest that panels of SNPs can be a useful tool for predicting cancer susceptibility and discovery of potentially important new disease loci. METHODOLOGY/PRINCIPAL FINDINGS In the present paper we undertake a careful examination of the relative significance of genetics, environmental factors, and biases of the data analysis protocol that was used in a previously published genome-wide association study. That prior study reported a nearly perfect discrimination of esophageal cancer patients and healthy controls on the basis of only genetic information. On the other hand, our results strongly suggest that SNPs in this dataset are not statistically linked to the phenotype, while several environmental factors and especially family history of esophageal cancer (a proxy to both environmental and genetic factors) have only a modest association with the disease. CONCLUSIONS/SIGNIFICANCE The main component of the previously claimed strong discriminatory signal is due to several data analysis pitfalls that in combination led to the strongly optimistic results. Such pitfalls are preventable and should be avoided in future studies since they create misleading conclusions and generate many false leads for subsequent research.
Collapse
Affiliation(s)
- Alexander Statnikov
- Discovery Systems Laboratory, Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, United States of America
| | - Chun Li
- Department of Biostatistics, Vanderbilt University, Nashville, Tennessee, United States of America
- Center for Human Genetics Research, Vanderbilt University, Nashville, Tennessee, United States of America
| | - Constantin F. Aliferis
- Discovery Systems Laboratory, Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, United States of America
- Department of Biostatistics, Vanderbilt University, Nashville, Tennessee, United States of America
- Department of Cancer Biology, Vanderbilt University, Nashville, Tennessee, United States of America
| |
Collapse
|
298
|
Kawai T, Morita K, Masuda K, Nishida K, Sekiyama A, Teshima-Kondo S, Nakaya Y, Ohta M, Saito T, Rokutan K. Physical exercise-associated gene expression signatures in peripheral blood. Clin J Sport Med 2007; 17:375-83. [PMID: 17873550 DOI: 10.1097/jsm.0b013e31814c3e4f] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
OBJECTIVE To assess response to physical stress, gene expression profiles in peripheral blood cells were analyzed using an original microarray carrying 1467 stress-responsive complementary DNA probes. DESIGN Gene expression was analyzed at 4, 24, and 48 hours after exercising on a cycle ergometer at 60% VO2 max for 1 hour (aerobic exercise) or until exhausted (exhaustive exercise). SETTING Institute of Health Biosciences, University of Tokushima Graduate School. PARTICIPANTS Twelve healthy male students of the postgraduate or undergraduate school. INTERVENTIONS The volunteers performed the aerobic or exhaustive exercise on a cycle ergometer. MAIN OUTCOME MEASUREMENTS Detection of aerobic exercise-responsive or exhaustive exercise-responsive genes in peripheral blood cells. RESULTS Aerobic and exhaustive exercise transiently changed the expression of 21 and 16 genes, respectively, with the peak at 4 hours. Only 2 genes significantly responded to both types of exercise. Exhaustive but not aerobic exercise produced a secondary response with significantly altered expression of 14 genes at 24 hours. Five of those genes encode receptors for neurotransmitters (HTR1A, CHRNB2, GABRB3, GABRG3, and LOC51289). CONCLUSIONS The behavior of the individual genes shown here may be informative to objectively assess acute physical stress and exhaustion-associated responses.
Collapse
Affiliation(s)
- Tomoko Kawai
- Department of Stress Science, Institute of Health Biosciences, The University of Tokushima Graduate School, Tokushima, Japan
| | | | | | | | | | | | | | | | | | | |
Collapse
|
299
|
Huang HL, Chang FL. ESVM: Evolutionary support vector machine for automatic feature selection and classification of microarray data. Biosystems 2007; 90:516-28. [PMID: 17280775 DOI: 10.1016/j.biosystems.2006.12.003] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2006] [Revised: 12/06/2006] [Accepted: 12/06/2006] [Indexed: 11/25/2022]
Abstract
An optimal design of support vector machine (SVM)-based classifiers for prediction aims to optimize the combination of feature selection, parameter setting of SVM, and cross-validation methods. However, SVMs do not offer the mechanism of automatic internal relevant feature detection. The appropriate setting of their control parameters is often treated as another independent problem. This paper proposes an evolutionary approach to designing an SVM-based classifier (named ESVM) by simultaneous optimization of automatic feature selection and parameter tuning using an intelligent genetic algorithm, combined with k-fold cross-validation regarded as an estimator of generalization ability. To illustrate and evaluate the efficiency of ESVM, a typical application to microarray classification using 11 multi-class datasets is adopted. By considering model uncertainty, a frequency-based technique by voting on multiple sets of potentially informative features is used to identify the most effective subset of genes. It is shown that ESVM can obtain a high accuracy of 96.88% with a small number 10.0 of selected genes using 10-fold cross-validation for the 11 datasets averagely. The merits of ESVM are three-fold: (1) automatic feature selection and parameter setting embedded into ESVM can advance prediction abilities, compared to traditional SVMs; (2) ESVM can serve not only as an accurate classifier but also as an adaptive feature extractor; (3) ESVM is developed as an efficient tool so that various SVMs can be used conveniently as the core of ESVM for bioinformatics problems.
Collapse
Affiliation(s)
- Hui-Ling Huang
- Department of Information Management, Jin Wen Institute of Technology, and Department of Anesthesiology, Tri-Service General Hospital, Taipei, Taiwan.
| | | |
Collapse
|
300
|
Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007; 23:2507-17. [PMID: 17720704 DOI: 10.1093/bioinformatics/btm344] [Citation(s) in RCA: 2016] [Impact Index Per Article: 112.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques. In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.
Collapse
Affiliation(s)
- Yvan Saeys
- Department of Plant Systems Biology, VIB, B-9052 Ghent, Belgium.
| | | | | |
Collapse
|