1
|
Abdelwahab MM, Al-Karawi KA, Semary HE. Deep Learning-Based Prediction of Alzheimer's Disease Using Microarray Gene Expression Data. Biomedicines 2023; 11:3304. [PMID: 38137524 PMCID: PMC10741889 DOI: 10.3390/biomedicines11123304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 12/02/2023] [Accepted: 12/04/2023] [Indexed: 12/24/2023] Open
Abstract
Alzheimer's disease is a genetically complex disorder, and microarray technology provides valuable insights into it. However, the high dimensionality of microarray datasets and small sample sizes pose challenges. Gene selection techniques have emerged as a promising solution to this challenge, potentially revolutionizing AD diagnosis. The study aims to investigate deep learning techniques, specifically neural networks, in predicting Alzheimer's disease using microarray gene expression data. The goal is to develop a reliable predictive model for early detection and diagnosis, potentially improving patient care and intervention strategies. This study employed gene selection techniques, including Singular Value Decomposition (SVD) and Principal Component Analysis (PCA), to pinpoint pertinent genes within microarray datasets. Leveraging deep learning principles, we harnessed a Convolutional Neural Network (CNN) as our classifier for Alzheimer's disease (AD) prediction. Our approach involved the utilization of a seven-layer CNN with diverse configurations to process the dataset. Empirical outcomes on the AD dataset underscored the effectiveness of the PCA-CNN model, yielding an accuracy of 96.60% and a loss of 0.3503. Likewise, the SVD-CNN model showcased remarkable accuracy, attaining 97.08% and a loss of 0.2466. These results accentuate the potential of our method for gene dimension reduction and classification accuracy enhancement by selecting a subset of pertinent genes. Integrating gene selection methodologies with deep learning architectures presents a promising framework for elevating AD prediction and promoting precision medicine in neurodegenerative disorders. Ongoing research endeavors aim to generalize this approach for diverse applications, explore alternative gene selection techniques, and investigate a variety of deep learning architectures.
Collapse
Affiliation(s)
- Mahmoud M. Abdelwahab
- Department of Mathematics and Statistics, College of Science, Imam Mohammad Ibn Saud Islamic University, Riyadh 11564, Saudi Arabia;
- Department of Basic Sciences, Higher Institute of Administrative Sciences, Belbeis 44621, Egypt
| | - Khamis A. Al-Karawi
- School of Science, Engineering and Environment, Salford University, Salford M5 4WT, UK;
- College of Veterinary Medicine, Diyala University, Baquba 32001, Iraq
| | - Hatem E. Semary
- Department of Mathematics and Statistics, College of Science, Imam Mohammad Ibn Saud Islamic University, Riyadh 11564, Saudi Arabia;
- Department of Statistics and Insurance, Faculty of Commerce, Zagazig University, Zagazig 44519, Egypt
| |
Collapse
|
2
|
Mitrović K, Petrušić I, Radojičić A, Daković M, Savić A. Migraine with aura detection and subtype classification using machine learning algorithms and morphometric magnetic resonance imaging data. Front Neurol 2023; 14:1106612. [PMID: 37441607 PMCID: PMC10333052 DOI: 10.3389/fneur.2023.1106612] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Accepted: 05/22/2023] [Indexed: 07/15/2023] Open
Abstract
Introduction Migraine with aura (MwA) is a neurological condition manifested in moderate to severe headaches associated with transient visual and somatosensory symptoms, as well as higher cortical dysfunctions. Considering that about 5% of the world's population suffers from this condition and manifestation could be abundant and characterized by various symptoms, it is of great importance to focus on finding new and advanced techniques for the detection of different phenotypes, which in turn, can allow better diagnosis, classification, and biomarker validation, resulting in tailored treatments of MwA patients. Methods This research aimed to test different machine learning techniques to distinguish healthy people from those suffering from MwA, as well as people with simple MwA and those experiencing complex MwA. Magnetic resonance imaging (MRI) post-processed data (cortical thickness, cortical surface area, cortical volume, cortical mean Gaussian curvature, and cortical folding index) was collected from 78 subjects [46 MwA patients (22 simple MwA and 24 complex MwA) and 32 healthy controls] with 340 different features used for the algorithm training. Results The results show that an algorithm based on post-processed MRI data yields a high classification accuracy (97%) of MwA patients and precise distinction between simple MwA and complex MwA with an accuracy of 98%. Additionally, the sets of features relevant to the classification were identified. The feature importance ranking indicates the thickness of the left temporal pole, right lingual gyrus, and left pars opercularis as the most prominent markers for MwA classification, while the thickness of left pericalcarine gyrus and left pars opercularis are proposed as the two most important features for the simple and complex MwA classification. Discussion This method shows significant potential in the validation of MwA diagnosis and subtype classification, which can tackle and challenge the current treatments of MwA.
Collapse
Affiliation(s)
- Katarina Mitrović
- Department of Information Technologies, Faculty of Technical Sciences in Čačak, University of Kragujevac, Čačak, Serbia
| | - Igor Petrušić
- Laboratory for Advanced Analysis of Neuroimages, Faculty of Physical Chemistry, University of Belgrade, Belgrade, Serbia
| | - Aleksandra Radojičić
- Headache Center, Neurology Clinic, Clinical Center of Serbia, Belgrade, Serbia
- Faculty of Medicine, University of Belgrade, Belgrade, Serbia
| | - Marko Daković
- Laboratory for Advanced Analysis of Neuroimages, Faculty of Physical Chemistry, University of Belgrade, Belgrade, Serbia
| | - Andrej Savić
- Science and Research Centre, School of Electrical Engineering, University of Belgrade, Belgrade, Serbia
| |
Collapse
|
3
|
Characterizing Spatiotemporal Transcriptome of the Human Brain Via Low-Rank Tensor Decomposition. STATISTICS IN BIOSCIENCES 2022. [DOI: 10.1007/s12561-021-09331-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
4
|
Cortez AJ, Kujawa KA, Wilk AM, Sojka DR, Syrkis JP, Olbryt M, Lisowska KM. Evaluation of the Role of ITGBL1 in Ovarian Cancer. Cancers (Basel) 2020; 12:E2676. [PMID: 32961775 PMCID: PMC7563769 DOI: 10.3390/cancers12092676] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Revised: 09/15/2020] [Accepted: 09/16/2020] [Indexed: 12/27/2022] Open
Abstract
In our previous microarray study we identified two subgroups of high-grade serous ovarian cancers with distinct gene expression and survival. Among differentially expressed genes was an Integrin beta-like 1 (ITGBL1), coding for a poorly characterized protein comprised of ten EGF-like repeats. Here, we have analyzed the influence of ITGBL1 on the phenotype of ovarian cancer (OC) cells. We analyzed expression of four putative ITGBL1 mRNA isoforms in five OC cell lines. OAW42 and SKOV3, having the lowest level of any ITGBL1 mRNA, were chosen to produce ITGBL1-overexpressing variants. In these cells, abundant ITGBL1 mRNA expression could be detected by RT-PCR. Immunodetection was successful only in the culture media, suggesting that ITGBL1 is efficiently secreted. We found that ITGBL1 overexpression affected cellular adhesion, migration and invasiveness, while it had no effect on proliferation rate and the cell cycle. ITGBL1-overexpressing cells were significantly more resistant to cisplatin and paclitaxel, major drugs used in OC treatment. Global gene expression analysis revealed that signaling pathways affected by ITGBL1 overexpression were mostly those related to extracellular matrix organization and function, integrin signaling, focal adhesion, cellular communication and motility; these results were consistent with the findings of our functional studies. Overall, our results indicate that higher expression of ITGBL1 in OC is associated with features that may worsen clinical course of the disease.
Collapse
Affiliation(s)
- Alexander Jorge Cortez
- Department of Biostatistics and Bioinformatics, Maria Skłodowska-Curie National Research Institute of Oncology, Gliwice Branch, 44-102 Gliwice, Poland; (A.J.C.); (A.M.W.)
- Center for Translational Research and Molecular Biology of Cancer, Maria Skłodowska-Curie National Research Institute of Oncology, Gliwice Branch, 44-102 Gliwice, Poland; (K.A.K.); (D.R.S.); (J.P.S.); (M.O.)
| | - Katarzyna Aleksandra Kujawa
- Center for Translational Research and Molecular Biology of Cancer, Maria Skłodowska-Curie National Research Institute of Oncology, Gliwice Branch, 44-102 Gliwice, Poland; (K.A.K.); (D.R.S.); (J.P.S.); (M.O.)
| | - Agata Małgorzata Wilk
- Department of Biostatistics and Bioinformatics, Maria Skłodowska-Curie National Research Institute of Oncology, Gliwice Branch, 44-102 Gliwice, Poland; (A.J.C.); (A.M.W.)
| | - Damian Robert Sojka
- Center for Translational Research and Molecular Biology of Cancer, Maria Skłodowska-Curie National Research Institute of Oncology, Gliwice Branch, 44-102 Gliwice, Poland; (K.A.K.); (D.R.S.); (J.P.S.); (M.O.)
| | - Joanna Patrycja Syrkis
- Center for Translational Research and Molecular Biology of Cancer, Maria Skłodowska-Curie National Research Institute of Oncology, Gliwice Branch, 44-102 Gliwice, Poland; (K.A.K.); (D.R.S.); (J.P.S.); (M.O.)
| | - Magdalena Olbryt
- Center for Translational Research and Molecular Biology of Cancer, Maria Skłodowska-Curie National Research Institute of Oncology, Gliwice Branch, 44-102 Gliwice, Poland; (K.A.K.); (D.R.S.); (J.P.S.); (M.O.)
| | - Katarzyna Marta Lisowska
- Center for Translational Research and Molecular Biology of Cancer, Maria Skłodowska-Curie National Research Institute of Oncology, Gliwice Branch, 44-102 Gliwice, Poland; (K.A.K.); (D.R.S.); (J.P.S.); (M.O.)
| |
Collapse
|
5
|
Wang YY, Cui C, Qi L, Yan H, Zhao XM. DrPOCS: Drug Repositioning Based on Projection Onto Convex Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:154-162. [PMID: 29993698 DOI: 10.1109/tcbb.2018.2830384] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Drug repositioning, i.e., identifying new indications for known drugs, has attracted a lot of attentions recently and is becoming an effective strategy in drug development. In literature, several computational approaches have been proposed to identify potential indications of old drugs based on various types of data sources. In this paper, by formulating the drug-disease associations as a low-rank matrix, we propose a novel method, namely DrPOCS, to identify candidate indications of old drugs based on projection onto convex sets (POCS). With the integration of drug structure and disease phenotype information, DrPOCS predicts potential associations between drugs and diseases with matrix completion. Benchmarking results demonstrate that our proposed approach outperforms popular existing approaches with high accuracy. In addition, a number of novel predicted indications are validated with various types of evidences, indicating the predictive power of our proposed approach.
Collapse
|
6
|
Girdhar K, Gruebele M, Chemla YR. The Behavioral Space of Zebrafish Locomotion and Its Neural Network Analog. PLoS One 2015; 10:e0128668. [PMID: 26132396 PMCID: PMC4489106 DOI: 10.1371/journal.pone.0128668] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2015] [Accepted: 04/30/2015] [Indexed: 11/18/2022] Open
Abstract
How simple is the underlying control mechanism for the complex locomotion of vertebrates? We explore this question for the swimming behavior of zebrafish larvae. A parameter-independent method, similar to that used in studies of worms and flies, is applied to analyze swimming movies of fish. The motion itself yields a natural set of fish "eigenshapes" as coordinates, rather than the experimenter imposing a choice of coordinates. Three eigenshape coordinates are sufficient to construct a quantitative "postural space" that captures >96% of the observed zebrafish locomotion. Viewed in postural space, swim bouts are manifested as trajectories consisting of cycles of shapes repeated in succession. To classify behavioral patterns quantitatively and to understand behavioral variations among an ensemble of fish, we construct a "behavioral space" using multi-dimensional scaling (MDS). This method turns each cycle of a trajectory into a single point in behavioral space, and clusters points based on behavioral similarity. Clustering analysis reveals three known behavioral patterns—scoots, turns, rests—but shows that these do not represent discrete states, but rather extremes of a continuum. The behavioral space not only classifies fish by their behavior but also distinguishes fish by age. With the insight into fish behavior from postural space and behavioral space, we construct a two-channel neural network model for fish locomotion, which produces strikingly similar postural space and behavioral space dynamics compared to real zebrafish.
Collapse
Affiliation(s)
- Kiran Girdhar
- Center for Biophysics and Computational Biology, University of Illinois, Urbana, IL, 61801, United States of America
| | - Martin Gruebele
- Center for Biophysics and Computational Biology, University of Illinois, Urbana, IL, 61801, United States of America
- Department of Physics, Center for the Physics of Living Cells, University of Illinois, Urbana, IL, 61801, United States of America
- Department of Chemistry, University of Illinois, Urbana, 61801, United States of America
- * E-mail: (YRC); (MG)
| | - Yann R. Chemla
- Center for Biophysics and Computational Biology, University of Illinois, Urbana, IL, 61801, United States of America
- Department of Physics, Center for the Physics of Living Cells, University of Illinois, Urbana, IL, 61801, United States of America
- * E-mail: (YRC); (MG)
| |
Collapse
|
7
|
|
8
|
Baralis E, Cerquitelli T, Chiusano S, D'elia V, Molinari R, Susta D. Early prediction of the highest workload in incremental cardiopulmonary tests. ACM T INTEL SYST TEC 2013. [DOI: 10.1145/2508037.2508051] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Incremental tests are widely used in cardiopulmonary exercise testing, both in the clinical domain and in sport sciences. The highest workload (denoted W
peak
) reached in the test is key information for assessing the individual body response to the test and for analyzing possible cardiac failures and planning rehabilitation, and training sessions. Being physically very demanding, incremental tests can significantly increase the body stress on monitored individuals and may cause cardiopulmonary overload. This article presents a new approach to cardiopulmonary testing that addresses these drawbacks. During the test, our approach analyzes the individual body response to the exercise and predicts the W
peak
value that will be reached in the test and an evaluation of its accuracy. When the accuracy of the prediction becomes satisfactory, the test can be prematurely stopped, thus avoiding its entire execution. To predict W
peak
, we introduce a new index, the CardioPulmonary Efficiency Index (CPE), summarizing the cardiopulmonary response of the individual to the test. Our approach analyzes the CPE trend during the test, together with the characteristics of the individual, and predicts W
peak
. A K-nearest-neighbor-based classifier and an ANN-based classier are exploited for the prediction. The experimental evaluation showed that the W
peak
value can be predicted with a limited error from the first steps of the test.
Collapse
|
9
|
Shabalin AA, Nobel AB. Reconstruction of a low-rank matrix in the presence of Gaussian noise. J MULTIVARIATE ANAL 2013. [DOI: 10.1016/j.jmva.2013.03.005] [Citation(s) in RCA: 79] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
10
|
Ramsköld D, Luo S, Wang YC, Li R, Deng Q, Faridani OR, Daniels GA, Khrebtukova I, Loring JF, Laurent LC, Schroth GP, Sandberg R. Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells. Nat Biotechnol 2013; 30:777-82. [PMID: 22820318 PMCID: PMC3467340 DOI: 10.1038/nbt.2282] [Citation(s) in RCA: 1075] [Impact Index Per Article: 97.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2011] [Accepted: 05/22/2012] [Indexed: 12/17/2022]
Abstract
In the last decade, genome-wide transcriptome analyses have been routinely used to monitor tissue-, disease- and cell type-specific gene expression, but it has been technically challenging to generate expression profiles from single cells. Here we describe a novel and robust mRNA-Seq protocol (Smart-Seq) that is applicable down to single cell levels. Compared with existing methods, Smart-Seq has improved read coverage across transcripts, which significantly enhances detailed analyses of alternative transcript isoforms and identification of SNPs. We have determined the sensitivity and quantitative accuracy of Smart-Seq for single-cell transcriptomics by evaluating it on total RNA dilution series. Applying Smart-Seq to circulating tumor cells from melanomas, we identified distinct gene expression patterns, including new candidate biomarkers for melanoma circulating tumor cells. Importantly, our protocol can easily be utilized for addressing fundamental biological problems requiring genome-wide transcriptome profiling in rare cells.
Collapse
|
11
|
miRNA-mRNA correlation-network modules in human prostate cancer and the differences between primary and metastatic tumor subtypes. PLoS One 2012; 7:e40130. [PMID: 22768240 PMCID: PMC3387006 DOI: 10.1371/journal.pone.0040130] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2012] [Accepted: 06/01/2012] [Indexed: 11/19/2022] Open
Abstract
Recent studies have shown the contribution of miRNAs to cancer pathogenesis. Prostate cancer is the most commonly diagnosed cancer in men. Unlike other major types of cancer, no single gene has been identified as being mutated in the majority of prostate tumors. This implies that the expression profiling of genes, including the non-coding miRNAs, may substantially vary across individual cases of this cancer. The within-class variability makes it possible to reconstruct or infer disease-specific miRNA-mRNA correlation and regulatory modular networks using high-dimensional microarray data of prostate tumor samples. Furthermore, since miRNAs and tumor suppressor genes are usually tissue specific, miRNA-mRNA modules could potentially differ between primary prostate cancer (PPC) and metastatic prostate cancer (MPC). We herein performed an in silico analysis to explore the miRNA-mRNA correlation network modules in the two tumor subtypes. Our analysis identified 5 miRNA-mRNA module pairs (MPs) for PPC and MPC, respectively. Each MP includes one positive-connection (correlation) module and one negative-connection (correlation) module. The number of miRNAs or mRNAs (genes) in each module varies from 2 to 8 or from 6 to 622. The modules discovered for PPC are more informative than those for MPC in terms of the implicated biological insights. In particular, one negative-connection module in PPC fits well with the popularly recognized miRNA-mediated post-transcriptional regulation theory. That is, the 3′UTR sequences of the involved mRNAs (∼620) are enriched with the target site motifs of the 7 modular miRNAs, has-miR-106b, -191, -19b, -92a, -92b, -93, and -141. About 330 GO terms and KEGG pathways, including TGF-beta signaling pathway that maintains tissue homeostasis and plays a crucial role in the suppression of the proliferation of cancer cells, are over-represented (adj.p<0.05) in the modular gene list. These computationally identified modules provide remarkable biological evidence for the interference of miRNAs in the development of prostate cancers and warrant additional follow-up in independent laboratory studies.
Collapse
|
12
|
Zhuang J, Widschwendter M, Teschendorff AE. A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform. BMC Bioinformatics 2012; 13:59. [PMID: 22524302 PMCID: PMC3364843 DOI: 10.1186/1471-2105-13-59] [Citation(s) in RCA: 87] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2011] [Accepted: 04/24/2012] [Indexed: 02/07/2023] Open
Abstract
Background The 27k Illumina Infinium Methylation Beadchip is a popular high-throughput technology that allows the methylation state of over 27,000 CpGs to be assayed. While feature selection and classification methods have been comprehensively explored in the context of gene expression data, relatively little is known as to how best to perform feature selection or classification in the context of Illumina Infinium methylation data. Given the rising importance of epigenomics in cancer and other complex genetic diseases, and in view of the upcoming epigenome wide association studies, it is critical to identify the statistical methods that offer improved inference in this novel context. Results Using a total of 7 large Illumina Infinium 27k Methylation data sets, encompassing over 1,000 samples from a wide range of tissues, we here provide an evaluation of popular feature selection, dimensional reduction and classification methods on DNA methylation data. Specifically, we evaluate the effects of variance filtering, supervised principal components (SPCA) and the choice of DNA methylation quantification measure on downstream statistical inference. We show that for relatively large sample sizes feature selection using test statistics is similar for M and β-values, but that in the limit of small sample sizes, M-values allow more reliable identification of true positives. We also show that the effect of variance filtering on feature selection is study-specific and dependent on the phenotype of interest and tissue type profiled. Specifically, we find that variance filtering improves the detection of true positives in studies with large effect sizes, but that it may lead to worse performance in studies with smaller yet significant effect sizes. In contrast, supervised principal components improves the statistical power, especially in studies with small effect sizes. We also demonstrate that classification using the Elastic Net and Support Vector Machine (SVM) clearly outperforms competing methods like LASSO and SPCA. Finally, in unsupervised modelling of cancer diagnosis, we find that non-negative matrix factorisation (NMF) clearly outperforms principal components analysis. Conclusions Our results highlight the importance of tailoring the feature selection and classification methodology to the sample size and biological context of the DNA methylation study. The Elastic Net emerges as a powerful classification algorithm for large-scale DNA methylation studies, while NMF does well in the unsupervised context. The insights presented here will be useful to any study embarking on large-scale DNA methylation profiling using Illumina Infinium beadarrays.
Collapse
Affiliation(s)
- Joanna Zhuang
- Statistical Genomics Group, Paul O'Gorman Building, UCL Cancer Institute, University College London, 72 Huntley Street, London WC1E 6BT, UK
| | | | | |
Collapse
|
13
|
Shukla S, Kavak E, Gregory M, Imashimizu M, Shutinoski B, Kashlev M, Oberdoerffer P, Sandberg R, Oberdoerffer S. CTCF-promoted RNA polymerase II pausing links DNA methylation to splicing. Nature 2012; 479:74-9. [PMID: 21964334 DOI: 10.1038/nature10442] [Citation(s) in RCA: 718] [Impact Index Per Article: 59.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2011] [Revised: 11/03/2011] [Accepted: 08/12/2011] [Indexed: 12/17/2022]
Abstract
Alternative splicing of pre-messenger RNA is a key feature of transcriptome expansion in eukaryotic cells, yet its regulation is poorly understood. Spliceosome assembly occurs co-transcriptionally, raising the possibility that DNA structure may directly influence alternative splicing. Supporting such an association, recent reports have identified distinct histone methylation patterns, elevated nucleosome occupancy and enriched DNA methylation at exons relative to introns. Moreover, the rate of transcription elongation has been linked to alternative splicing. Here we provide the first evidence that a DNA-binding protein, CCCTC-binding factor (CTCF), can promote inclusion of weak upstream exons by mediating local RNA polymerase II pausing both in a mammalian model system for alternative splicing, CD45, and genome-wide. We further show that CTCF binding to CD45 exon 5 is inhibited by DNA methylation, leading to reciprocal effects on exon 5 inclusion. These findings provide a mechanistic basis for developmental regulation of splicing outcome through heritable epigenetic marks.
Collapse
Affiliation(s)
- Sanjeev Shukla
- Center for Cancer Research, Mouse Cancer Genetics Program, National Cancer Institute at Frederick, Frederick, Maryland 21702, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Construction of protein interaction networks based on the label-free quantitative proteomics. Methods Mol Biol 2011; 781:71-85. [PMID: 21877278 DOI: 10.1007/978-1-61779-276-2_5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Multiprotein complexes are essential building blocks for many cellular processes in an organism. Taking the process of transcription as an example, the interplay of several chromatin-remodeling complexes is responsible for the tight regulation of gene expression. Knowing how those proteins associate into protein complexes not only helps to improve our understanding of these cellular processes, but can also lead to the discovery of the function of novel interacting proteins. Given the large number of proteins with little to no functional annotation throughout many organisms, including human, the identification and characterization of protein complexes has grown into a major focus of network biology. Toward this goal, we have developed several computational approaches based upon label-free quantitative proteomics approaches for the analysis of protein complexes and protein interaction networks. Here, we describe the computational approaches used to build probabilistic protein interaction networks, which are detailed in this chapter using the example of complexes involved in chromatin remodeling and transcription.
Collapse
|
15
|
Abstract
Environmental stressors such as chemicals and physical agents induce various oxidative stresses and affect human health. To elucidate their underlying mechanisms, etiology and risk, analyses of gene expression signatures in environmental stress-induced human diseases, including neuronal disorders, cancer and diabetes, are crucially important. Recent studies have clarified oxidative stress-induced signaling pathways in human and experimental animals. These pathways are classifiable into several categories: reactive oxygen species (ROS) metabolism and antioxidant defenses, p53 pathway signaling, nitric oxide (NO) signaling pathway, hypoxia signaling, transforming growth factor (TGF)-beta bone morphogenetic protein (BMP) signaling, tumor necrosis factor (TNF) ligand-receptor signaling, and mitochondrial function. This review describes the gene expression signatures through which environmental stressors induce oxidative stress and regulate signal transduction pathways in rodent and human tissues.
Collapse
Affiliation(s)
- H Sone
- National Institute for Environmental Studies, 16-2 Onogawa, Tsukuba, Ibaraki, Japan.
| | | | | |
Collapse
|
16
|
svdPPCS: an effective singular value decomposition-based method for conserved and divergent co-expression gene module identification. BMC Bioinformatics 2010; 11:338. [PMID: 20565989 PMCID: PMC2905369 DOI: 10.1186/1471-2105-11-338] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2010] [Accepted: 06/22/2010] [Indexed: 12/25/2022] Open
Abstract
Background Comparative analysis of gene expression profiling of multiple biological categories, such as different species of organisms or different kinds of tissue, promises to enhance the fundamental understanding of the universality as well as the specialization of mechanisms and related biological themes. Grouping genes with a similar expression pattern or exhibiting co-expression together is a starting point in understanding and analyzing gene expression data. In recent literature, gene module level analysis is advocated in order to understand biological network design and system behaviors in disease and life processes; however, practical difficulties often lie in the implementation of existing methods. Results Using the singular value decomposition (SVD) technique, we developed a new computational tool, named svdPPCS (SVD-based Pattern Pairing and Chart Splitting), to identify conserved and divergent co-expression modules of two sets of microarray experiments. In the proposed methods, gene modules are identified by splitting the two-way chart coordinated with a pair of left singular vectors factorized from the gene expression matrices of the two biological categories. Importantly, the cutoffs are determined by a data-driven algorithm using the well-defined statistic, SVD-p. The implementation was illustrated on two time series microarray data sets generated from the samples of accessory gland (ACG) and malpighian tubule (MT) tissues of the line W118 of M. drosophila. Two conserved modules and six divergent modules, each of which has a unique characteristic profile across tissue kinds and aging processes, were identified. The number of genes contained in these models ranged from five to a few hundred. Three to over a hundred GO terms were over-represented in individual modules with FDR < 0.1. One divergent module suggested the tissue-specific relationship between the expressions of mitochondrion-related genes and the aging process. This finding, together with others, may be of biological significance. The validity of the proposed SVD-based method was further verified by a simulation study, as well as the comparisons with regression analysis and cubic spline regression analysis plus PAM based clustering. Conclusions svdPPCS is a novel computational tool for the comparative analysis of transcriptional profiling. It especially fits the comparison of time series data of related organisms or different tissues of the same organism under equivalent or similar experimental conditions. The general scheme can be directly extended to the comparisons of multiple data sets. It also can be applied to the integration of data sets from different platforms and of different sources.
Collapse
|
17
|
Zhu D. Semi-supervised gene shaving method for predicting low variation biological pathways from genome-wide data. BMC Bioinformatics 2009; 10 Suppl 1:S54. [PMID: 19208157 PMCID: PMC2648790 DOI: 10.1186/1471-2105-10-s1-s54] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND The gene shaving algorithm and many other clustering algorithms identify gene clusters showing high variation across samples. However, gene expression in many signaling pathways show only modest and concordant changes that fail to be identified by these methods. The increasingly available signaling pathway prior knowledge provide new opportunity to solve this problem. RESULTS We propose an innovative semi-supervised gene clustering algorithm, where the original gene shaving algorithm was extended and generalized so that prior knowledge of signaling pathways can be incorporated. Different from other methods, our method identifies gene clusters showing concerted and modest expression variation as well as strong expression correlation. Using available pathway gene sets as prior knowledge, whether complete or incomplete, our algorithm is capable of forming tightly regulated gene clusters showing modest variation across samples. We demonstrate the advantages of our algorithm over the original gene shaving algorithm using two microarray data sets. The stability of the gene clusters was accessed using a jackknife approach. CONCLUSION Our algorithm represents one of the first clustering algorithms that is particularly designed to identify signaling pathways of low and concordant gene expression variation. The discriminating power is achieved by manufacturing a principal component enriched by signaling pathways.
Collapse
Affiliation(s)
- Dongxiao Zhu
- Department of Computer Science, University of New Orleans, New Orleans, LA 70148, USA.
| |
Collapse
|
18
|
Liu Q, Zhang Y, Xu Y, Ye X. Fuzzy Kernel Clustering of RNA Secondary Structure Ensemble Using a Novel Similarity Metric. J Biomol Struct Dyn 2008; 25:685-96. [DOI: 10.1080/07391102.2008.10507214] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
19
|
Probabilistic assembly of human protein interaction networks from label-free quantitative proteomics. Proc Natl Acad Sci U S A 2008; 105:1454-9. [PMID: 18218781 DOI: 10.1073/pnas.0706983105] [Citation(s) in RCA: 196] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Large-scale affinity purification and mass spectrometry studies have played important roles in the assembly and analysis of comprehensive protein interaction networks for lower eukaryotes. However, the development of such networks for human proteins has been slowed by the high cost and significant technical challenges associated with systematic studies of protein interactions. To address this challenge, we have developed a method for building local and focused networks. This approach couples vector algebra and statistical methods with normalized spectral counting (NSAF) derived from the analysis of affinity purifications via chromatography-based proteomics. After mathematical removal of contaminant proteins, the core components of multiprotein complexes are determined by singular value decomposition analysis and clustering. The probability of interactions within and between complexes is computed solely based upon NSAFs using Bayes' approach. To demonstrate the application of this method to small-scale datasets, we analyzed an expanded human TIP49a and TIP49b dataset. This dataset contained proteins affinity-purified with 27 different epitope-tagged components of the chromatin remodeling SRCAP, hINO80, and TRRAP/TIP60 complexes, and the nutrient sensing complex Uri/Prefoldin. Within a core network of 65 unique proteins, we captured all known components of these complexes and novel protein associations, especially in the Uri/Prefoldin complex. Finally, we constructed a probabilistic human interaction network composed of 557 protein pairs.
Collapse
|
20
|
Wu H, Yuan M, Kaech SM, Halloran ME. A statistical analysis of memory CD8 T cell differentiation: An application of a hierarchical state space model to a short time course microarray experiment. Ann Appl Stat 2007. [DOI: 10.1214/07-aoas118] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
21
|
Fujibuchi W, Kato T. Classification of heterogeneous microarray data by maximum entropy kernel. BMC Bioinformatics 2007; 8:267. [PMID: 17651507 PMCID: PMC1994960 DOI: 10.1186/1471-2105-8-267] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2007] [Accepted: 07/26/2007] [Indexed: 11/10/2022] Open
Abstract
Background There is a large amount of microarray data accumulating in public databases, providing various data waiting to be analyzed jointly. Powerful kernel-based methods are commonly used in microarray analyses with support vector machines (SVMs) to approach a wide range of classification problems. However, the standard vectorial data kernel family (linear, RBF, etc.) that takes vectorial data as input, often fails in prediction if the data come from different platforms or laboratories, due to the low gene overlaps or consistencies between the different datasets. Results We introduce a new type of kernel called maximum entropy (ME) kernel, which has no pre-defined function but is generated by kernel entropy maximization with sample distance matrices as constraints, into the field of SVM classification of microarray data. We assessed the performance of the ME kernel with three different data: heterogeneous kidney carcinoma, noise-introduced leukemia, and heterogeneous oral cavity carcinoma metastasis data. The results clearly show that the ME kernel is very robust for heterogeneous data containing missing values and high-noise, and gives higher prediction accuracies than the standard kernels, namely, linear, polynomial and RBF. Conclusion The results demonstrate its utility in effectively analyzing promiscuous microarray data of rare specimens, e.g., minor diseases or species, that present difficulty in compiling homogeneous data in a single laboratory.
Collapse
Affiliation(s)
- Wataru Fujibuchi
- National Institute of Advanced Industrial Science and Technology (AIST), Computational Biology Research Center, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
| | - Tsuyoshi Kato
- National Institute of Advanced Industrial Science and Technology (AIST), Computational Biology Research Center, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
- Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8562, Japan
| |
Collapse
|
22
|
Mamtani MR, Thakre TP, Kalkonde MY, Amin MA, Kalkonde YV, Amin AP, Kulkarni H. A simple method to combine multiple molecular biomarkers for dichotomous diagnostic classification. BMC Bioinformatics 2006; 7:442. [PMID: 17032455 PMCID: PMC1618410 DOI: 10.1186/1471-2105-7-442] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2006] [Accepted: 10/10/2006] [Indexed: 11/29/2022] Open
Abstract
Background In spite of the recognized diagnostic potential of biomarkers, the quest for squelching noise and wringing in information from a given set of biomarkers continues. Here, we suggest a statistical algorithm that – assuming each molecular biomarker to be a diagnostic test – enriches the diagnostic performance of an optimized set of independent biomarkers employing established statistical techniques. We validated the proposed algorithm using several simulation datasets in addition to four publicly available real datasets that compared i) subjects having cancer with those without; ii) subjects with two different cancers; iii) subjects with two different types of one cancer; and iv) subjects with same cancer resulting in differential time to metastasis. Results Our algorithm comprises of three steps: estimating the area under the receiver operating characteristic curve for each biomarker, identifying a subset of biomarkers using linear regression and combining the chosen biomarkers using linear discriminant function analysis. Combining these established statistical methods that are available in most statistical packages, we observed that the diagnostic accuracy of our approach was 100%, 99.94%, 96.67% and 93.92% for the real datasets used in the study. These estimates were comparable to or better than the ones previously reported using alternative methods. In a synthetic dataset, we also observed that all the biomarkers chosen by our algorithm were indeed truly differentially expressed. Conclusion The proposed algorithm can be used for accurate diagnosis in the setting of dichotomous classification of disease states.
Collapse
Affiliation(s)
| | - Tushar P Thakre
- Lata Medical Research Foundation, Nagpur, India
- University of North Texas Health Science Center, Fort Worth, Texas, USA
| | | | | | | | - Amit P Amin
- Lata Medical Research Foundation, Nagpur, India
| | | |
Collapse
|
23
|
Inoue LYT, Neira M, Nelson C, Gleave M, Etzioni R. Cluster-based network model for time-course gene expression data. Biostatistics 2006; 8:507-25. [PMID: 16980695 DOI: 10.1093/biostatistics/kxl026] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We propose a model-based approach to unify clustering and network modeling using time-course gene expression data. Specifically, our approach uses a mixture model to cluster genes. Genes within the same cluster share a similar expression profile. The network is built over cluster-specific expression profiles using state-space models. We discuss the application of our model to simulated data as well as to time-course gene expression data arising from animal models on prostate cancer progression. The latter application shows that with a combined statistical/bioinformatics analyses, we are able to extract gene-to-gene relationships supported by the literature as well as new plausible relationships.
Collapse
Affiliation(s)
- Lurdes Y T Inoue
- Department of Biostatistics, University of Washington, F-600 Health Sciences Building, Campus Mail Stop 357232, Seattle, WA 98195, USA.
| | | | | | | | | |
Collapse
|
24
|
Pascual-Montano A, Carmona-Saez P, Chagoyen M, Tirado F, Carazo JM, Pascual-Marqui RD. bioNMF: a versatile tool for non-negative matrix factorization in biology. BMC Bioinformatics 2006; 7:366. [PMID: 16875499 PMCID: PMC1550731 DOI: 10.1186/1471-2105-7-366] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2006] [Accepted: 07/28/2006] [Indexed: 12/02/2022] Open
Abstract
Background In the Bioinformatics field, a great deal of interest has been given to Non-negative matrix factorization technique (NMF), due to its capability of providing new insights and relevant information about the complex latent relationships in experimental data sets. This method, and some of its variants, has been successfully applied to gene expression, sequence analysis, functional characterization of genes and text mining. Even if the interest on this technique by the bioinformatics community has been increased during the last few years, there are not many available simple standalone tools to specifically perform these types of data analysis in an integrated environment. Results In this work we propose a versatile and user-friendly tool that implements the NMF methodology in different analysis contexts to support some of the most important reported applications of this new methodology. This includes clustering and biclustering gene expression data, protein sequence analysis, text mining of biomedical literature and sample classification using gene expression. The tool, which is named bioNMF, also contains a user-friendly graphical interface to explore results in an interactive manner and facilitate in this way the exploratory data analysis process. Conclusion bioNMF is a standalone versatile application which does not require any special installation or libraries. It can be used for most of the multiple applications proposed in the bioinformatics field or to support new research using this method. This tool is publicly available at .
Collapse
Affiliation(s)
- Alberto Pascual-Montano
- Computer Architecture Department, Facultad de Ciencias Físicas, Universidad Complutense de Madrid, 28040, Spain
| | - Pedro Carmona-Saez
- BioComputing Unit, National Center of Biotechnology, Campus Universidad Autónoma de Madrid, 28049, Spain
| | - Monica Chagoyen
- Computer Architecture Department, Facultad de Ciencias Físicas, Universidad Complutense de Madrid, 28040, Spain
- BioComputing Unit, National Center of Biotechnology, Campus Universidad Autónoma de Madrid, 28049, Spain
| | - Francisco Tirado
- Computer Architecture Department, Facultad de Ciencias Físicas, Universidad Complutense de Madrid, 28040, Spain
| | - Jose M Carazo
- BioComputing Unit, National Center of Biotechnology, Campus Universidad Autónoma de Madrid, 28049, Spain
| | - Roberto D Pascual-Marqui
- The KEY Institute for Brain-Mind Research, University Hospital of Psychiatry. Lenggstr. 31, CH-8029 Zurich, Switzerland
| |
Collapse
|
25
|
Sen TZ, Kloczkowski A, Jernigan RL. Functional clustering of yeast proteins from the protein-protein interaction network. BMC Bioinformatics 2006; 7:355. [PMID: 16863590 PMCID: PMC1557866 DOI: 10.1186/1471-2105-7-355] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2006] [Accepted: 07/24/2006] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND The abundant data available for protein interaction networks have not yet been fully understood. New types of analyses are needed to reveal organizational principles of these networks to investigate the details of functional and regulatory clusters of proteins. RESULTS In the present work, individual clusters identified by an eigenmode analysis of the connectivity matrix of the protein-protein interaction network in yeast are investigated for possible functional relationships among the members of the cluster. With our functional clustering we have successfully predicted several new protein-protein interactions that indeed have been reported recently. CONCLUSION Eigenmode analysis of the entire connectivity matrix yields both a global and a detailed view of the network. We have shown that the eigenmode clustering not only is guided by the number of proteins with which each protein interacts, but also leads to functional clustering that can be applied to predict new protein interactions.
Collapse
Affiliation(s)
- Taner Z Sen
- L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University Ames, IA 50011, USA
- Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA 50011, USA
| | - Andrzej Kloczkowski
- Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA 50011, USA
| | - Robert L Jernigan
- L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University Ames, IA 50011, USA
- Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
26
|
Dabrowski M, Adach A, Aerts S, Moreau Y, Kaminska B. Identification of conserved modes of expression profiles during hippocampal development and neuronal differentiation in vitro. J Neurochem 2006; 97 Suppl 1:87-91. [PMID: 16635255 DOI: 10.1111/j.1471-4159.2005.03537.x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
Gene expression profiles can be regarded as sums of simpler modes, analogous to the modes of a vibrating violin string. Decomposition of temporal gene expression profiles into modes by singular value decomposition (SVD) was reported before, but the question as to what degree the SVD modes can be interpreted in terms of biology remains open. We report and compare the results of SVD of published datasets from hippocampal development, neuronal differentiation in vitro, and a control time-series hippocampal dataset. We demonstrate that the first SVD mode reflects the magnitude of expression, interpretable on the Affymetrix platform. In the datasets from gene profiling of hippocampal development and neuronal differentiation, the second mode reflects a monotonous change in expression, either up- or down-regulation, in the time course of experiment. We demonstrate that the top two SVD modes are conserved between datasets and therefore, likely reflect properties of the underlying system (gene expression in hippocampus) rather than of a particular experiment or dataset. Our results also indicate that the magnitude of expression, and the direction of change in expression during hippocampal development, are uncorrelated, suggesting that they are regulated by largely independent mechanisms.
Collapse
Affiliation(s)
- Michal Dabrowski
- Laboratory of Transcription Regulation, Department of Cell Biology, The Nencki Institute of Experimental Biology, Warsaw, Poland.
| | | | | | | | | |
Collapse
|
27
|
Roden JC, King BW, Trout D, Mortazavi A, Wold BJ, Hart CE. Mining gene expression data by interpreting principal components. BMC Bioinformatics 2006; 7:194. [PMID: 16600052 PMCID: PMC1501050 DOI: 10.1186/1471-2105-7-194] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2005] [Accepted: 04/07/2006] [Indexed: 12/04/2022] Open
Abstract
Background There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across only some conditions, rather than all or most conditions as required in traditional clustering; e.g. genes that are highly up-regulated and/or down-regulated similarly across only a subset of conditions. Equally important is the need to learn which conditions are the decisive ones in forming such gene sets of interest, and how they relate to diverse conditional covariates, such as disease diagnosis or prognosis. Results We present a method for automatically identifying such candidate sets of biologically relevant genes using a combination of principal components analysis and information theoretic metrics. To enable easy use of our methods, we have developed a data analysis package that facilitates visualization and subsequent data mining of the independent sources of significant variation present in gene microarray expression datasets (or in any other similarly structured high-dimensional dataset). We applied these tools to two public datasets, and highlight sets of genes most affected by specific subsets of conditions (e.g. tissues, treatments, samples, etc.). Statistically significant associations for highlighted gene sets were shown via global analysis for Gene Ontology term enrichment. Together with covariate associations, the tool provides a basis for building testable hypotheses about the biological or experimental causes of observed variation. Conclusion We provide an unsupervised data mining technique for diverse microarray expression datasets that is distinct from major methods now in routine use. In test uses, this method, based on publicly available gene annotations, appears to identify numerous sets of biologically relevant genes. It has proven especially valuable in instances where there are many diverse conditions (10's to hundreds of different tissues or cell types), a situation in which many clustering and ordering algorithms become problematic. This approach also shows promise in other topic domains such as multi-spectral imaging datasets.
Collapse
Affiliation(s)
- Joseph C Roden
- Jet Propulsion Laboratory, California Institute of Technology, Pasadena, USA
| | - Brandon W King
- Division of Biology, California Institute of Technology, Pasadena, USA
| | - Diane Trout
- Division of Biology, California Institute of Technology, Pasadena, USA
| | - Ali Mortazavi
- Division of Biology, California Institute of Technology, Pasadena, USA
| | - Barbara J Wold
- Division of Biology, California Institute of Technology, Pasadena, USA
| | | |
Collapse
|
28
|
Carter GW, Rupp S, Fink GR, Galitski T. Disentangling information flow in the Ras-cAMP signaling network. Genome Res 2006; 16:520-6. [PMID: 16533914 PMCID: PMC1457029 DOI: 10.1101/gr.4473506] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The perturbation of signal-transduction molecules elicits genomic-expression effects that are typically neither restricted to a small set of genes nor uniform. Instead there are broad, varied, and complex changes in expression across the genome. These observations suggest that signal transduction is not mediated by isolated pathways of information flow to distinct groups of genes in the genome. Rather, multiple entangled paths of information flow influence overlapping sets of genes. Using the Ras-cAMP pathway in Saccharomyces cerevisiae as a model system, we perturbed key pathway elements and collected genomic-expression data. Singular value decomposition was applied to separate the genome-wide transcriptional response into weighted expression components exhibited by overlapping groups of genes. Molecular interaction data were integrated to connect gene groups to perturbed signaling elements. The resulting series of linked subnetworks maps multiple putative pathways of information flow through a dense signaling network, and provides a set of testable hypotheses for complex gene-expression effects across the genome.
Collapse
|
29
|
Hu J, Wright FA, Zou F. Estimation of Expression Indexes for Oligonucleotide Arrays Using the Singular Value Decomposition. J Am Stat Assoc 2006. [DOI: 10.1198/016214505000000989] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
30
|
Cios KJ, Mamitsuka H, Nagashima T, Tadeusiewicz R. Computational intelligence in solving bioinformatics problems. Artif Intell Med 2005; 35:1-8. [PMID: 16095889 DOI: 10.1016/j.artmed.2005.07.001] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
31
|
Liang Y, Kelemen A. Associating phenotypes with molecular events: recent statistical advances and challenges underpinning microarray experiments. Funct Integr Genomics 2005; 6:1-13. [PMID: 16292543 DOI: 10.1007/s10142-005-0006-z] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2005] [Revised: 06/22/2005] [Accepted: 08/16/2005] [Indexed: 10/25/2022]
Abstract
Progress in mapping the genome and developments in array technologies have provided large amounts of information for delineating the roles of genes involved in complex diseases and quantitative traits. Since complex phenotypes are determined by a network of interrelated biological traits typically involving multiple inter-correlated genetic and environmental factors that interact in a hierarchical fashion, microarrays hold tremendous latent information. The analysis of microarray data is, however, still a bottleneck. In this paper, we review the recent advances in statistical analyses for associating phenotypes with molecular events underpinning microarray experiments. Classical statistical procedures to analyze phenotypes in genetics are reviewed first, followed by descriptions of the statistical procedures for linking molecular events to measured gene expression phenotypes (microarray-based gene expression) and observed phenotypes such as diseases status. These statistical procedures include (1) prior analysis, such as data quality controls, and normalization analyses for minimizing the effects of experimental artifacts and random noise; (2) gene selections and differentiation procedures based on inferential statistics for the class comparisons; (3) dynamic temporal patterns analysis through exploratory statistics such as unsupervised clustering and supervised classification and predictions; (4) assessing the reliability of microarray studies using real-time PCR and the reproducibility issues from many studies and multiple platforms. In addition, the post analysis to associate the discovered patterns of gene expression to pathway and functional analysis for selected genes are also considered in order to increase our understanding of interconnected gene processes.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Biostatistics, The State University of New York at Buffalo, Buffalo, NY 14214, USA.
| | | |
Collapse
|
32
|
Hand DJ, Heard NA. Finding groups in gene expression data. J Biomed Biotechnol 2005; 2005:215-25. [PMID: 16046827 PMCID: PMC1184051 DOI: 10.1155/jbb.2005.215] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2004] [Revised: 08/24/2004] [Accepted: 08/24/2004] [Indexed: 11/18/2022] Open
Abstract
The vast potential of the genomic insight offered by microarray technologies has led to their widespread use since they were introduced a decade ago. Application areas include gene function discovery, disease diagnosis, and inferring regulatory networks. Microarray experiments enable large-scale, high-throughput investigations of gene activity and have thus provided the data analyst with a distinctive, high-dimensional field of study. Many questions in this field relate to finding subgroups of data profiles which are very similar. A popular type of exploratory tool for finding subgroups is cluster analysis, and many different flavors of algorithms have been used and indeed tailored for microarray data. Cluster analysis, however, implies a partitioning of the entire data set, and this does not always match the objective. Sometimes pattern discovery or bump hunting tools are more appropriate. This paper reviews these various tools for finding interesting subgroups.
Collapse
Affiliation(s)
- David J Hand
- Department of Mathematics, Faculty of Physical Sciences, Imperial College, London SW7 2AZ, UK.
| | | |
Collapse
|
33
|
Teramoto KI, Tada M, Tamoto E, Abe M, Kawakami A, Komuro K, Matsunaga A, Shindoh G, Takada M, Murakawa K, Kanai M, Kobayashi N, Fujiwara Y, Nishimura N, Shirata K, Takahishi T, Ishizu A, Ikeda H, Hamada JI, Kondo S, Katoh H, Moriuchi T, Yoshiki T. Prediction of lymphatic invasion/lymph node metastasis, recurrence, and survival in patients with gastric cancer by cDNA array-based expression profiling. J Surg Res 2005; 124:225-36. [PMID: 15820252 DOI: 10.1016/j.jss.2004.10.003] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2004] [Indexed: 01/21/2023]
Abstract
BACKGROUND We assessed the predictability of various classes of gastric carcinoma defined by clinicopathological parameters, such as invasiveness and clinical outcomes, using cDNA array data obtained from 54 cases. MATERIALS AND METHODS We searched an optimal combination of genes to discriminate the classes defined with the clinicopathological parameters by using a feature subset selection algorithm, which was applied to a set of genes preselected on the basis of statistical difference in expression (two-sided t test, P < or = 0.05). With the selected features (gene set), we evaluated the predictability of each parameter in a leave-one-out cross-validation test. RESULTS We successfully selected sets of genes for which the classifier predicted better versus worse overall survival (tumor-specific death) and tumor-free survival (recurrence), with respective classification rates of 94 and 92%. A contingency table analysis (chi2 test) and Cox proportional hazard model analysis revealed that lymph node metastasis is the most important factor (confounding factor) in patients' prognoses and risks of recurrence. The feature subset selection procedure successfully extracted expression patterns characteristic of lymph node metastasis and lymphatic vessel invasion, yielding 92 and 98% prediction accuracies for these respective factors. CONCLUSION We conclude that expression profiling using feature subset selection provides a powerful means of stratification of gastric cancer patients in regard to the prognostic factors. Further studies should be warranted to apply this method to personalization of the treatment options.
Collapse
MESH Headings
- Adenocarcinoma, Mucinous/genetics
- Adenocarcinoma, Mucinous/mortality
- Adenocarcinoma, Mucinous/secondary
- Adult
- Aged
- Aged, 80 and over
- Anticipation, Genetic
- Carcinoma, Papillary/genetics
- Carcinoma, Papillary/mortality
- Carcinoma, Papillary/secondary
- Carcinoma, Signet Ring Cell/genetics
- Carcinoma, Signet Ring Cell/mortality
- Carcinoma, Signet Ring Cell/secondary
- Female
- Gene Expression Profiling/methods
- Gene Expression Regulation, Neoplastic
- Humans
- Lymphatic Metastasis/genetics
- Lymphatic Metastasis/pathology
- Male
- Middle Aged
- Neoplasm Recurrence, Local/genetics
- Neoplasm Recurrence, Local/mortality
- Oligonucleotide Array Sequence Analysis/methods
- Predictive Value of Tests
- Prognosis
- Reverse Transcriptase Polymerase Chain Reaction
- Stomach Neoplasms/genetics
- Stomach Neoplasms/mortality
- Stomach Neoplasms/secondary
- Survival Analysis
Collapse
Affiliation(s)
- Ken-ichi Teramoto
- Department of Pathology/Pathophysiology, Division of Pathophysiological Science, Hokkaido University Graduate School of Medicine, Sapporo, Japan
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
Liang Y, Tayo B, Cai X, Kelemen A. Differential and trajectory methods for time course gene expression data. Bioinformatics 2005; 21:3009-16. [PMID: 15886280 PMCID: PMC2574001 DOI: 10.1093/bioinformatics/bti465] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The issue of high dimensionality in microarray data has been, and remains, a hot topic in statistical and computational analysis. Efficient gene filtering and differentiation approaches can reduce the dimensions of data, help to remove redundant genes and noises, and highlight the most relevant genes that are major players in the development of certain diseases or the effect of drug treatment. The purpose of this study is to investigate the efficiency of parametric (including Bayesian and non-Bayesian, linear and non-linear), non-parametric and semi-parametric gene filtering methods through the application of time course microarray data from multiple sclerosis patients being treated with interferon-beta-1a. The analysis of variance with bootstrapping (parametric), class dispersion (semi-parametric) and Pareto (non-parametric) with permutation methods are presented and compared for filtering and finding differentially expressed genes. The Bayesian linear correlated model, the Bayesian non-linear model the and non-Bayesian mixed effects model with bootstrap were also developed to characterize the differential expression patterns. Furthermore, trajectory-clustering approaches were developed in order to investigate the dynamic patterns and inter-dependency of drug treatment effects on gene expression. RESULTS Results show that the presented methods performed significant differently but all were adequate in capturing a small number of the potentially relevant genes to the disease. The parametric method, such as the mixed model and two Bayesian approaches proved to be more conservative. This may because these methods are based on overall variation in expression across all time points. The semi-parametric (class dispersion) and non-parametric (Pareto) methods were appropriate in capturing variation in expression from time point to time point, thereby making them more suitable for investigating significant monotonic changes and trajectories of changes in gene expressions in time course microarray data. Also, the non-linear Bayesian model proved to be less conservative than linear Bayesian correlated growth models to filter out the redundant genes, although the linear model showed better fit than non-linear model (smaller DIC). We also report the trajectories of significant genes-since we have been able to isolate trajectories of genes whose regulations appear to be inter-dependent.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Biostatistics, University at Buffalo Buffalo, NY 14226, USA.
| | | | | | | |
Collapse
|
35
|
Cavalieri D, De Filippo C. Bioinformatic methods for integrating whole-genome expression results into cellular networks. Drug Discov Today 2005; 10:727-34. [PMID: 15896686 DOI: 10.1016/s1359-6446(05)03433-1] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Extracting a comprehensive overview from the huge amount of information arising from whole-genome analyses is a significant challenge. This review critically surveys the state of the art methods that are used to connect information from functional genomic studies to biological function. Cluster analysis methods for inferring the correlation between genes are discussed, as are the methods for integrating gene expression information with existing information on biological pathways and the methods that combine cluster analysis with biological information to reconstruct novel biological networks.
Collapse
Affiliation(s)
- Duccio Cavalieri
- Department of Pharmacology, University of Florence, Viale Pieraccini 6, 50139 Florence, Italy.
| | | |
Collapse
|
36
|
Chiappetta P, Roubaud MC, Torrésani B. Blind Source Separation and the Analysis of Microarray Data. J Comput Biol 2004; 11:1090-109. [PMID: 15662200 DOI: 10.1089/cmb.2004.11.1090] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We develop an approach for the exploratory analysis of gene expression data, based upon blind source separation techniques. This approach exploits higher-order statistics to identify a linear model for (logarithms of) expression profiles, described as linear combinations of "independent sources." As a result, it yields "elementary expression patterns" (the "sources"), which may be interpreted as potential regulation pathways. Further analysis of the so-obtained sources show that they are generally characterized by a small number of specific coexpressed or antiexpressed genes. In addition, the projections of the expression profiles onto the estimated sources often provides significant clustering of conditions. The algorithm relies on a large number of runs of "independent component analysis" with random initializations, followed by a search of "consensus sources." It then provides estimates for independent sources, together with an assessment of their robustness. The results obtained on two datasets (namely, breast cancer data and Bacillus subtilis sulfur metabolism data) show that some of the obtained gene families correspond to well known families of coregulated genes, which validates the proposed approach.
Collapse
Affiliation(s)
- P Chiappetta
- Laboratoire d'Analyse, Topologie et Probabilités, Centre de Mathématiques et Informatique, Université de Provence, France
| | | | | |
Collapse
|
37
|
Liu B, Cui Q, Jiang T, Ma S. A combinational feature selection and ensemble neural network method for classification of gene expression data. BMC Bioinformatics 2004; 5:136. [PMID: 15450124 PMCID: PMC522806 DOI: 10.1186/1471-2105-5-136] [Citation(s) in RCA: 83] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2004] [Accepted: 09/27/2004] [Indexed: 02/08/2023] Open
Abstract
Background Microarray experiments are becoming a powerful tool for clinical diagnosis, as they have the potential to discover gene expression patterns that are characteristic for a particular disease. To date, this problem has received most attention in the context of cancer research, especially in tumor classification. Various feature selection methods and classifier design strategies also have been generally used and compared. However, most published articles on tumor classification have applied a certain technique to a certain dataset, and recently several researchers compared these techniques based on several public datasets. But, it has been verified that differently selected features reflect different aspects of the dataset and some selected features can obtain better solutions on some certain problems. At the same time, faced with a large amount of microarray data with little knowledge, it is difficult to find the intrinsic characteristics using traditional methods. In this paper, we attempt to introduce a combinational feature selection method in conjunction with ensemble neural networks to generally improve the accuracy and robustness of sample classification. Results We validate our new method on several recent publicly available datasets both with predictive accuracy of testing samples and through cross validation. Compared with the best performance of other current methods, remarkably improved results can be obtained using our new strategy on a wide range of different datasets. Conclusions Thus, we conclude that our methods can obtain more information in microarray data to get more accurate classification and also can help to extract the latent marker genes of the diseases for better diagnosis and treatment.
Collapse
MESH Headings
- Acute Disease
- Artificial Intelligence
- Colonic Neoplasms/classification
- Colonic Neoplasms/genetics
- Female
- Gene Expression Profiling/classification
- Gene Expression Profiling/methods
- Gene Expression Regulation, Neoplastic/genetics
- Humans
- Leukemia, Myeloid/classification
- Leukemia, Myeloid/genetics
- Lung Neoplasms/classification
- Lung Neoplasms/genetics
- Lymphoma, B-Cell/classification
- Lymphoma, B-Cell/genetics
- Lymphoma, Large B-Cell, Diffuse/classification
- Lymphoma, Large B-Cell, Diffuse/genetics
- Male
- Neural Networks, Computer
- Oligonucleotide Array Sequence Analysis/classification
- Oligonucleotide Array Sequence Analysis/methods
- Ovarian Neoplasms/classification
- Ovarian Neoplasms/genetics
- Precursor Cell Lymphoblastic Leukemia-Lymphoma/classification
- Precursor Cell Lymphoblastic Leukemia-Lymphoma/genetics
- Predictive Value of Tests
- Prostatic Neoplasms/classification
- Prostatic Neoplasms/genetics
Collapse
Affiliation(s)
- Bing Liu
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, P. R. China
| | - Qinghua Cui
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, P. R. China
| | - Tianzi Jiang
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, P. R. China
| | - Songde Ma
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, P. R. China
| |
Collapse
|
38
|
Challacombe JF, Rechtsteiner A, Gottardo R, Rocha LM, Browne EP, Shenk T, Altherr MR, Brettin TS. Evaluation of the host transcriptional response to human cytomegalovirus infection. Physiol Genomics 2004; 18:51-62. [PMID: 15069167 DOI: 10.1152/physiolgenomics.00155.2003] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Gene expression data from human cytomegalovirus (HCMV)-infected cells were analyzed using DNA-Chip Analyzer (dChip) followed by singular value decomposition (SVD) and compared with a previous analysis of the same data that employed GeneChip software and a fold change filtering approach. dChip and SVD analysis revealed two clusters of coexpressed human genes responding differently to HCMV infection: one containing some genes identified previously, and another that was largely unique to this analysis. Annotating these genes, we identified several functional categories important to host cell responses to HCMV infection. These categories included genes involved in transcriptional regulation, oncogenesis, and cell cycle regulation, which were more prevalent in cluster 1, and genes involved in immune system regulation, signal transduction, and cell adhesion, which were more prevalent in cluster 2. Within these categories, we found genes involved in the host response to HCMV infection (mainly in cluster 1), as well as genes targeted by HCMV’s immune evasion strategies (mainly in cluster 2). As the second group of genes identified by the dChip and SVD approach was statistically and biologically significant, our results point out the advantages of using different methods to analyze gene expression data.
Collapse
Affiliation(s)
- Jean F Challacombe
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
| | | | | | | | | | | | | | | |
Collapse
|
39
|
Simek K, Kimmel M. A note on estimation of dynamics of multiple gene expression based on singular value decomposition. Math Biosci 2003; 182:183-99. [PMID: 12591624 DOI: 10.1016/s0025-5564(02)00185-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Recently, data on multiple gene expression at sequential time points were analyzed, using singular value decomposition (SVD) as a means to capture dominant trends, called characteristic modes, followed by fitting of a linear discrete-time dynamical system in which the expression values at a given time point are linear combinations of the values at a previous time point. We attempt to address several aspects of the method. To obtain the model we formulate a non-linear optimization problem and present how to solve it numerically using standard MATLAB procedures. We use publicly available data to test the approach. For reader's convenience, we provide a straightforward, ready-to-use, procedure in MATLAB, which employs its standard features to analyze data of this kind. Then, we investigate the sensitivity of the method to missing measurements and its possibilities to reconstruct missing data. Also, we discuss the possible consequences of data regularization, called sometimes 'polishing', on the outcome of analysis, especially when model is to be used for prediction purposes. Summarizing we point out that approximation of multiple gene expression data preceded by SVD provides some insight into the dynamics but may also lead to unexpected difficulties, like overfitting problems.
Collapse
Affiliation(s)
- Krzysztof Simek
- Department of Statistics, Rice University, Mail Stop 138, 6100 Main Street, P.O. 1892, Houston, TX 77005, USA
| | | |
Collapse
|
40
|
Hörnquist M, Hertz J, Wahde M. Effective dimensionality of large-scale expression data using principal component analysis. Biosystems 2002; 65:147-56. [PMID: 12069725 DOI: 10.1016/s0303-2647(02)00011-4] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Large-scale expression data are today measured for thousands of genes simultaneously. This development is followed by an exploration of theoretical tools to get as much information out of these data as possible. One line is to try to extract the underlying regulatory network. The models used thus far, however, contain many parameters, and a careful investigation is necessary in order not to over-fit the models. We employ principal component analysis to show how, in the context of linear additive models, one can get a rough estimate of the effective dimensionality (the number of information-carrying dimensions) of large-scale gene expression datasets. We treat both the lack of independence of different measurements in a time series and the fact that that measurements are subject to some level of noise, both of which reduce the effective dimensionality and thereby constrain the complexity of models which can be built from the data.
Collapse
|
41
|
Fogolari F, Tessari S, Molinari H. Singular value decomposition analysis of protein sequence alignment score data. Proteins 2002; 46:161-70. [PMID: 11807944 DOI: 10.1002/prot.10032] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
One of the standard tools for the analysis of data arranged in matrix form is singular value decomposition (SVD). Few applications to genomic data have been reported to date mainly for the analysis of gene expression microarray data. We review SVD properties, examine mathematical terms and assumptions implicit in the SVD formalism, and show that SVD can be applied to the analysis of matrices representing pairwise alignment scores between large sets of protein sequences. In particular, we illustrate SVD capabilities for data dimension reduction and for clustering protein sequences. A comparison is performed between SVD-generated clusters of proteins and annotation reported in the SWISS-PROT Database for a set of protein sequences forming the calycin superfamily, entailing all entries corresponding to the lipocalin, cytosolic fatty acid-binding protein, and avidin-streptavidin Prosite patterns.
Collapse
Affiliation(s)
- F Fogolari
- Dipartimento Scientifico Tecnologico, Facoltà di Scienze, Università di Verona, Verona, Italy.
| | | | | |
Collapse
|
42
|
Landgrebe J, Wurst W, Welzl G. Permutation-validated principal components analysis of microarray data. Genome Biol 2002; 3:RESEARCH0019. [PMID: 11983060 PMCID: PMC115254 DOI: 10.1186/gb-2002-3-4-research0019] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2001] [Revised: 01/31/2002] [Accepted: 02/15/2002] [Indexed: 11/11/2022] Open
Abstract
BACKGROUND In microarray data analysis, the comparison of gene-expression profiles with respect to different conditions and the selection of biologically interesting genes are crucial tasks. Multivariate statistical methods have been applied to analyze these large datasets. Less work has been published concerning the assessment of the reliability of gene-selection procedures. Here we describe a method to assess reliability in multivariate microarray data analysis using permutation-validated principal components analysis (PCA). The approach is designed for microarray data with a group structure. RESULTS We used PCA to detect the major sources of variance underlying the hybridization conditions followed by gene selection based on PCA-derived and permutation-based test statistics. We validated our method by applying it to well characterized yeast cell-cycle data and to two datasets from our laboratory. We could describe the major sources of variance, select informative genes and visualize the relationship of genes and arrays. We observed differences in the level of the explained variance and the interpretability of the selected genes. CONCLUSIONS Combining data visualization and permutation-based gene selection, permutation-validated PCA enables one to illustrate gene-expression variance between several conditions and to select genes by taking into account the relationship of between-group to within-group variance of genes. The method can be used to extract the leading sources of variance from microarray data, to visualize relationships between genes and hybridizations and to select informative genes in a statistically reliable manner. This selection accounts for the level of reproducibility of replicates or group structure as well as gene-specific scatter. Visualization of the data can support a straightforward biological interpretation.
Collapse
Affiliation(s)
- Jobst Landgrebe
- Institute of Biomathematics and Biometry, GSF-National Research Center for Environment and Health, Ingolstädter Landstrasse 1, D-85764 Neuherberg, Germany.
| | | | | |
Collapse
|
43
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2001. [PMCID: PMC2447222 DOI: 10.1002/cfg.60] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
|