1
|
Neijzen D, Lunter G. Unsupervised learning for medical data: A review of probabilistic factorization methods. Stat Med 2023; 42:5541-5554. [PMID: 37850249 DOI: 10.1002/sim.9924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 09/13/2023] [Indexed: 10/19/2023]
Abstract
We review popular unsupervised learning methods for the analysis of high-dimensional data encountered in, for example, genomics, medical imaging, cohort studies, and biobanks. We show that four commonly used methods, principal component analysis, K-means clustering, nonnegative matrix factorization, and latent Dirichlet allocation, can be written as probabilistic models underpinned by a low-rank matrix factorization. In addition to highlighting their similarities, this formulation clarifies the various assumptions and restrictions of each approach, which eases identifying the appropriate method for specific applications for applied medical researchers. We also touch upon the most important aspects of inference and model selection for the application of these methods to health data.
Collapse
Affiliation(s)
- Dorien Neijzen
- Department of Epidemiology, University of Groningen, University Medical Center Groningen, Groningen, the Netherlands
| | - Gerton Lunter
- Department of Epidemiology, University of Groningen, University Medical Center Groningen, Groningen, the Netherlands
- Weatherall Institute of Molecular Medicine, Oxford University, Oxford, UK
| |
Collapse
|
2
|
Johnson JAI, Tsang AP, Mitchell JT, Zhou DL, Bowden J, Davis-Marcisak E, Sherman T, Liefeld T, Loth M, Goff LA, Zimmerman JW, Kinny-Köster B, Jaffee EM, Tamayo P, Mesirov JP, Reich M, Fertig EJ, Stein-O'Brien GL. Inferring cellular and molecular processes in single-cell data with non-negative matrix factorization using Python, R and GenePattern Notebook implementations of CoGAPS. Nat Protoc 2023; 18:3690-3731. [PMID: 37989764 PMCID: PMC10961825 DOI: 10.1038/s41596-023-00892-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Accepted: 07/21/2023] [Indexed: 11/23/2023]
Abstract
Non-negative matrix factorization (NMF) is an unsupervised learning method well suited to high-throughput biology. However, inferring biological processes from an NMF result still requires additional post hoc statistics and annotation for interpretation of learned features. Here, we introduce a suite of computational tools that implement NMF and provide methods for accurate and clear biological interpretation and analysis. A generalized discussion of NMF covering its benefits, limitations and open questions is followed by four procedures for the Bayesian NMF algorithm Coordinated Gene Activity across Pattern Subsets (CoGAPS). Each procedure will demonstrate NMF analysis to quantify cell state transitions in a public domain single-cell RNA-sequencing dataset. The first demonstrates PyCoGAPS, our new Python implementation that enhances runtime for large datasets, and the second allows its deployment in Docker. The third procedure steps through the same single-cell NMF analysis using our R CoGAPS interface. The fourth introduces a beginner-friendly CoGAPS platform using GenePattern Notebook, aimed at users with a working conceptual knowledge of data analysis but without a basic proficiency in the R or Python programming language. We also constructed a user-facing website to serve as a central repository for information and instructional materials about CoGAPS and its application programming interfaces. The expected timing to setup the packages and conduct a test run is around 15 min, and an additional 30 min to conduct analyses on a precomputed result. The expected runtime on the user's desired dataset can vary from hours to days depending on factors such as dataset size or input parameters.
Collapse
Affiliation(s)
- Jeanette A I Johnson
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Ashley P Tsang
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Jacob T Mitchell
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - David L Zhou
- Department of Neuroscience, Johns Hopkins University, Baltimore, MD, USA
| | - Julia Bowden
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Emily Davis-Marcisak
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Thomas Sherman
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Ted Liefeld
- Department of Medicine, Moores Cancer Center, University of California San Diego, San Diego, CA, USA
| | - Melanie Loth
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Loyal A Goff
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Department of Neuroscience, Johns Hopkins University, Baltimore, MD, USA
- Kavli Neurodiscovery Institute, Johns Hopkins University, Baltimore, MD, USA
- Single Cell Training and Analysis Center, Johns Hopkins University, Baltimore, MD, USA
| | - Jacquelyn W Zimmerman
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Kinny-Köster
- Department of Surgery, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Elizabeth M Jaffee
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Pablo Tamayo
- Department of Medicine, Moores Cancer Center, University of California San Diego, San Diego, CA, USA
| | - Jill P Mesirov
- Department of Medicine, Moores Cancer Center, University of California San Diego, San Diego, CA, USA
| | - Michael Reich
- Department of Medicine, Moores Cancer Center, University of California San Diego, San Diego, CA, USA
| | - Elana J Fertig
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA.
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Single Cell Training and Analysis Center, Johns Hopkins University, Baltimore, MD, USA.
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA.
| | - Genevieve L Stein-O'Brien
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA.
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Department of Neuroscience, Johns Hopkins University, Baltimore, MD, USA.
- Kavli Neurodiscovery Institute, Johns Hopkins University, Baltimore, MD, USA.
- Single Cell Training and Analysis Center, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
3
|
Cantini L, Kairov U, de Reyniès A, Barillot E, Radvanyi F, Zinovyev A. Assessing reproducibility of matrix factorization methods in independent transcriptomes. Bioinformatics 2020; 35:4307-4313. [PMID: 30938767 PMCID: PMC6821374 DOI: 10.1093/bioinformatics/btz225] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2018] [Revised: 03/20/2019] [Accepted: 04/01/2019] [Indexed: 12/26/2022] Open
Abstract
Motivation Matrix factorization (MF) methods are widely used in order to reduce dimensionality of transcriptomic datasets to the action of few hidden factors (metagenes). MF algorithms have never been compared based on the between-datasets reproducibility of their outputs in similar independent datasets. Lack of this knowledge might have a crucial impact when generalizing the predictions made in a study to others. Results We systematically test widely used MF methods on several transcriptomic datasets collected from the same cancer type (14 colorectal, 8 breast and 4 ovarian cancer transcriptomic datasets). Inspired by concepts of evolutionary bioinformatics, we design a novel framework based on Reciprocally Best Hit (RBH) graphs in order to benchmark the MF methods for their ability to produce generalizable components. We show that a particular protocol of application of independent component analysis (ICA), accompanied by a stabilization procedure, leads to a significant increase in the between-datasets reproducibility. Moreover, we show that the signals detected through this method are systematically more interpretable than those of other standard methods. We developed a user-friendly tool for performing the Stabilized ICA-based RBH meta-analysis. We apply this methodology to the study of colorectal cancer (CRC) for which 14 independent transcriptomic datasets can be collected. The resulting RBH graph maps the landscape of interconnected factors associated to biological processes or to technological artifacts. These factors can be used as clinical biomarkers or robust and tumor-type specific transcriptomic signatures of tumoral cells or tumoral microenvironment. Their intensities in different samples shed light on the mechanistic basis of CRC molecular subtyping. Availability and implementation The RBH construction tool is available from http://goo.gl/DzpwYp Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Laura Cantini
- Institut Curie, PSL Research University, F-75005 Paris, France.,INSERM U900, F-75005 Paris, France.,CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, F-75006 Paris, France.,Computational Systems Biology Team, Institut de Biologie de l'École Normale Supérieure, CNRS UMR8197, INSERM U1024, École Normale Supérieure, PSL Research University, Paris, France
| | - Ulykbek Kairov
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, Astana, Kazakhstan
| | - Aurélien de Reyniès
- Programme Cartes d'Identité des Tumeurs (CIT), Ligue Nationale Contre le Cancer, Paris, France
| | - Emmanuel Barillot
- Institut Curie, PSL Research University, F-75005 Paris, France.,INSERM U900, F-75005 Paris, France.,CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, F-75006 Paris, France
| | - François Radvanyi
- Institut Curie, PSL Research University, CNRS, UMR144, Equipe Labellisée Ligue Contre le Cancer, Paris, France.,Sorbonne Universités, UPMC Université Paris 06, CNRS, UMR144, Paris
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, F-75005 Paris, France.,INSERM U900, F-75005 Paris, France.,CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, F-75006 Paris, France.,Lobachevsky University, Nizhny Novgorod, Russia
| |
Collapse
|
4
|
Yi C, Chen C, Si Y, Li F, Zhang T, Liao Y, Jiang Y, Yao D, Xu P. Constructing large-scale cortical brain networks from scalp EEG with Bayesian nonnegative matrix factorization. Neural Netw 2020; 125:338-348. [DOI: 10.1016/j.neunet.2020.02.021] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2019] [Revised: 11/20/2019] [Accepted: 02/28/2020] [Indexed: 11/30/2022]
|
5
|
Stein-O'Brien GL, Arora R, Culhane AC, Favorov AV, Garmire LX, Greene CS, Goff LA, Li Y, Ngom A, Ochs MF, Xu Y, Fertig EJ. Enter the Matrix: Factorization Uncovers Knowledge from Omics. Trends Genet 2018; 34:790-805. [PMID: 30143323 PMCID: PMC6309559 DOI: 10.1016/j.tig.2018.07.003] [Citation(s) in RCA: 100] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Revised: 06/01/2018] [Accepted: 07/16/2018] [Indexed: 12/20/2022]
Abstract
Omics data contain signals from the molecular, physical, and kinetic inter- and intracellular interactions that control biological systems. Matrix factorization (MF) techniques can reveal low-dimensional structure from high-dimensional data that reflect these interactions. These techniques can uncover new biological knowledge from diverse high-throughput omics data in applications ranging from pathway discovery to timecourse analysis. We review exemplary applications of MF for systems-level analyses. We discuss appropriate applications of these methods, their limitations, and focus on the analysis of results to facilitate optimal biological interpretation. The inference of biologically relevant features with MF enables discovery from high-throughput data beyond the limits of current biological knowledge - answering questions from high-dimensional data that we have not yet thought to ask.
Collapse
Affiliation(s)
- Genevieve L Stein-O'Brien
- Department of Oncology, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD, USA; Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, USA; McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Raman Arora
- Department of Computer Science, Institute for Data Intensive Engineering and Science, Johns Hopkins University, Baltimore, MD, USA
| | - Aedin C Culhane
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA; Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, USA
| | - Alexander V Favorov
- Department of Oncology, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD, USA; Vavilov Institute of General Genetics, Moscow, Russia
| | | | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, PA, USA; Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, PA, USA
| | - Loyal A Goff
- Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, USA; McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Yifeng Li
- Digital Technologies Research Centre, National Research Council of Canada, Ottawa, ON, Canada
| | - Aloune Ngom
- School of Computer Science, University of Windsor, Windsor, ON, Canada
| | - Michael F Ochs
- Department of Mathematics and Statistics, The College of New Jersey, Ewing, NJ, USA
| | - Yanxun Xu
- Department of Applied Mathematics and Statistics, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Elana J Fertig
- Department of Oncology, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
6
|
Miller JJ, Cochlin L, Clarke K, Tyler DJ. Weighted averaging in spectroscopic studies improves statistical power. Magn Reson Med 2017; 78:2082-2094. [PMID: 28127795 PMCID: PMC5697704 DOI: 10.1002/mrm.26615] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2016] [Revised: 12/08/2016] [Accepted: 12/28/2016] [Indexed: 12/19/2022]
Abstract
Purpose In vivo MRS is often characterized by a spectral signal‐to‐noise ratio (SNR) that varies highly between experiments. A common design for spectroscopic studies is to compare the ratio of two spectral peak amplitudes between groups, e.g. individual PCr/γ‐ATP ratios in 31P‐MRS. The uncertainty on this ratio is often neglected. We wished to explore this assumption. Theory The canonical theory for the propagation of uncertainty on the ratio of two spectral peaks and its incorporation in the Frequentist hypothesis testing framework by weighted averaging is presented. Methods Two retrospective re‐analyses of studies comparing spectral peak ratios and one prospective simulation were performed using both the weighted and unweighted methods. Results It was found that propagating uncertainty correctly improved statistical power in all cases considered, which could be used to reduce the number of subjects required to perform an MR study. Conclusion The variability of in vivo spectroscopy data is often accounted for by requiring it to meet an SNR threshold. A theoretically sound propagation of the variable uncertainty caused by quantifying spectra of differing SNR is therefore likely to improve the power of in vivo spectroscopy studies. Magn Reson Med 78:2082–2094, 2017. © 2017 The Authors Magnetic Resonance in Medicine published by Wiley Periodicals, Inc. on behalf of International Society for Magnetic Resonance in Medicine. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
Collapse
Affiliation(s)
- Jack J Miller
- Department of Physiology, Anatomy & Genetics, University of Oxford, Oxford, UK.,Department of Physics, Clarendon Laboratory, University of Oxford, Oxford, UK
| | - Lowri Cochlin
- Department of Physiology, Anatomy & Genetics, University of Oxford, Oxford, UK
| | - Kieran Clarke
- Department of Physiology, Anatomy & Genetics, University of Oxford, Oxford, UK
| | - Damian J Tyler
- Department of Physiology, Anatomy & Genetics, University of Oxford, Oxford, UK
| |
Collapse
|
7
|
Gómez J, Vinaixa M, Rodríguez MA, Salek RM, Correig X, Cañellas N. Dolphin 1D: Improving Automation of Targeted Metabolomics in Multi-matrix Datasets of $$^1$$ 1 H-NMR Spectra. ADVANCES IN INTELLIGENT SYSTEMS AND COMPUTING 2015. [DOI: 10.1007/978-3-319-19776-0_7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
8
|
Gómez J, Brezmes J, Mallol R, Rodríguez MA, Vinaixa M, Salek RM, Correig X, Cañellas N. Dolphin: a tool for automatic targeted metabolite profiling using 1D and 2D 1H-NMR data. Anal Bioanal Chem 2014; 406:7967-76. [DOI: 10.1007/s00216-014-8225-6] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2014] [Revised: 09/25/2014] [Accepted: 09/30/2014] [Indexed: 01/22/2023]
|
9
|
Data analysis and tissue type assignment for glioblastoma multiforme. BIOMED RESEARCH INTERNATIONAL 2014; 2014:762126. [PMID: 24724098 PMCID: PMC3958772 DOI: 10.1155/2014/762126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/18/2013] [Revised: 01/13/2014] [Accepted: 01/23/2014] [Indexed: 11/18/2022]
Abstract
Glioblastoma multiforme (GBM) is characterized by high infiltration. The interpretation of MRSI data, especially for GBMs, is still challenging. Unsupervised methods based on NMF by Li et al. (2013, NMR in Biomedicine) and Li et al. (2013, IEEE Transactions on Biomedical Engineering) have been proposed for glioma recognition, but the tissue types is still not well interpreted. As an extension of the previous work, a tissue type assignment method is proposed for GBMs based on the analysis of MRSI data and tissue distribution information. The tissue type assignment method uses the values from the distribution maps of all three tissue types to interpret all the information in one new map and color encodes each voxel to indicate the tissue type. Experiments carried out on in vivo MRSI data show the feasibility of the proposed method. This method provides an efficient way for GBM tissue type assignment and helps to display information of MRSI data in a way that is easy to interpret.
Collapse
|
10
|
Fertig EJ, Stein-O'Brien G, Jaffe A, Colantuoni C. Pattern identification in time-course gene expression data with the CoGAPS matrix factorization. Methods Mol Biol 2014; 1101:87-112. [PMID: 24233779 DOI: 10.1007/978-1-62703-721-1_6] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Patterns in time-course gene expression data can represent the biological processes that are active over the measured time period. However, the orthogonality constraint in standard pattern-finding algorithms, including notably principal components analysis (PCA), confounds expression changes resulting from simultaneous, non-orthogonal biological processes. Previously, we have shown that Markov chain Monte Carlo nonnegative matrix factorization algorithms are particularly adept at distinguishing such concurrent patterns. One such matrix factorization is implemented in the software package CoGAPS. We describe the application of this software and several technical considerations for identification of age-related patterns in a public, prefrontal cortex gene expression dataset.
Collapse
Affiliation(s)
- Elana J Fertig
- Oncology Biostatistics and Bioinformatics, Johns Hopkins University, Baltimore, MD, USA
| | | | | | | |
Collapse
|
11
|
Riedel N, Berg J. Statistical mechanics approach to the sample deconvolution problem. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2013; 87:042715. [PMID: 23679457 DOI: 10.1103/physreve.87.042715] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/22/2013] [Indexed: 06/02/2023]
Abstract
In a multicellular organism different cell types express a gene in different amounts. Samples from which gene expression levels can be measured typically contain a mixture of different cell types; the resulting measurements thus give only averages over the different cell types present. Based on fluctuations in the mixture proportions from sample to sample it is in principle possible to reconstruct the underlying expression levels of each cell type: to deconvolute the sample. We use a statistical mechanics approach to the problem of deconvoluting such partial concentrations from mixed samples, explore this approach using Markov chain Monte Carlo simulations, and give analytical results for when and how well samples can be unmixed.
Collapse
Affiliation(s)
- N Riedel
- Institut für Theoretische Physik, University of Cologne - Zülpicher Strasse 77, 50937 Köln, Germany Sybacol, University of Cologne, Germany.
| | | |
Collapse
|
12
|
Li Y, Sima DM, Cauter SV, Croitor Sava AR, Himmelreich U, Pi Y, Van Huffel S. Hierarchical non-negative matrix factorization (hNMF): a tissue pattern differentiation method for glioblastoma multiforme diagnosis using MRSI. NMR IN BIOMEDICINE 2013; 26:307-319. [PMID: 22972709 DOI: 10.1002/nbm.2850] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/04/2012] [Revised: 08/04/2012] [Accepted: 08/06/2012] [Indexed: 06/01/2023]
Abstract
MRSI has shown potential in the diagnosis and prognosis of glioblastoma multiforme (GBM) brain tumors, but its use is limited by difficult data interpretation. When the analyzed MRSI data present more than two tissue patterns, conventional non-negative matrix factorization (NMF) implementation may lead to a non-robust estimation. The aim of this article is to introduce an effective approach for the differentiation of GBM tissue patterns using MRSI data. A hierarchical non-negative matrix factorization (hNMF) method that can blindly separate the most important spectral sources in short-TE ¹H MRSI data is proposed. This algorithm consists of several levels of NMF, where only two tissue patterns are computed at each level. The method is demonstrated on both simulated and in vivo short-TE ¹H MRSI data in patients with GBM. For the in vivo study, the accuracy of the recovered spectral sources was validated using expert knowledge. Results show that hNMF is able to accurately estimate the three tissue patterns present in the tumoral and peritumoral area of a GBM, i.e. normal, tumor and necrosis, thus providing additional useful information that can help in the diagnosis of GBM. Moreover, the hNMF results can be displayed as easily interpretable maps showing the contribution of each tissue pattern to each voxel.
Collapse
Affiliation(s)
- Yuqian Li
- School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China.
| | | | | | | | | | | | | |
Collapse
|
13
|
Du S, Sajda P, Brown T, Stoyanova R. Recovery of Metabolomic Spectral Sources using Non-negative Matrix Factorization. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2012; 2005:1095-8. [PMID: 17282379 DOI: 10.1109/iembs.2005.1616610] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
<sup>1</sup>H magnetic resonance spectra (MRS) of biofluids contain rich biochemical information about the metabolic status of an organism. Through the application of pattern recognition and classification algorithms, such data have been shown to provide information for disease diagnosis as well as the effects of potential therapeutics. In this paper we describe a novel approach, using non-negative matrix factorization (NMF), for rapidly identifying metabolically meaningful spectral patterns in<sup>1</sup>H MRS. We show that the intensities of these identified spectral patterns can be related to the onset of, and recovery from, toxicity in both a time-related and dose-related fashion. These patterns can be seen as a new type of biomarker for the biological effect under study. We demonstrate, using k-means clustering, that the recovered patterns can be used to characterize the metabolic status of the animal during the experiment.
Collapse
Affiliation(s)
- Shuyan Du
- Department of Biomedical Engineering, Columbia University, New York, NY, USA 10027
| | | | | | | |
Collapse
|
14
|
Ochs MF, Fertig EJ. Matrix Factorization for Transcriptional Regulatory Network Inference. IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY PROCEEDINGS. IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY 2012; 2012:387-396. [PMID: 25364782 DOI: 10.1109/cibcb.2012.6217256] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Inference of Transcriptional Regulatory Networks (TRNs) provides insight into the mechanisms driving biological systems, especially mammalian development and disease. Many techniques have been developed for TRN estimation from indirect biochemical measurements. Although successful when initially tested in model organisms, these regulatory models often fail when applied to data from multicellular organisms where multiple regulation and gene reuse increase dramatically. Non-negative matrix factorization techniques were initially introduced to find non-orthogonal patterns in data, making them ideal techniques for inference in cases of multiple regulation. We review these techniques and their application to TRN analysis.
Collapse
Affiliation(s)
- Michael F Ochs
- School of Medicine, Johns Hopkins University, Baltimore, MD 21205
| | - Elana J Fertig
- School of Medicine, Johns Hopkins University, Baltimore, MD 21205
| |
Collapse
|
15
|
Fertig EJ, Slebos R, Chung CH. Application of genomic and proteomic technologies in biomarker discovery. Am Soc Clin Oncol Educ Book 2012:377-382. [PMID: 24451767 DOI: 10.14694/edbook_am.2012.32.156] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Sequencing of the human genome was completed in 2001. Building on the technology and experience of whole-exome sequencing, numerous cancer genomes have been sequenced, including head and neck squamous cell carcinoma (HNSCC) in 2011. Although DNA sequencing data reveals a complex genome with numerous mutations, the biologic interaction and clinical significance of the overall genetic aberrations are largely unknown. Comprehensive analyses of the tumors using genomics and proteomics beyond sequencing data can potentially accelerate the rate and number of biomarker discoveries to improve biology-driven classification of tumors for prognosis and patient selection for a specific therapy. In this review, we will summarize the current genomic and proteomic technologies, general biomarker-discovery paradigms using the technology and published data in HNSCC-including potential clinical applications and limitations.
Collapse
Affiliation(s)
- Elana J Fertig
- From the Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD; Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN
| | - Robbert Slebos
- From the Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD; Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN
| | - Christine H Chung
- From the Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD; Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN
| |
Collapse
|
16
|
Tulpan D, Léger S, Belliveau L, Culf A, Cuperlović-Culf M. MetaboHunter: an automatic approach for identification of metabolites from 1H-NMR spectra of complex mixtures. BMC Bioinformatics 2011; 12:400. [PMID: 21999117 PMCID: PMC3213069 DOI: 10.1186/1471-2105-12-400] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2011] [Accepted: 10/14/2011] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND One-dimensional 1H-NMR spectroscopy is widely used for high-throughput characterization of metabolites in complex biological mixtures. However, the accurate identification of individual compounds is still a challenging task, particularly in spectral regions with higher peak densities. The need for automatic tools to facilitate and further improve the accuracy of such tasks, while using increasingly larger reference spectral libraries becomes a priority of current metabolomics research. RESULTS We introduce a web server application, called MetaboHunter, which can be used for automatic assignment of 1H-NMR spectra of metabolites. MetaboHunter provides methods for automatic metabolite identification based on spectra or peak lists with three different search methods and with possibility for peak drift in a user defined spectral range. The assignment is performed using as reference libraries manually curated data from two major publicly available databases of NMR metabolite standard measurements (HMDB and MMCD). Tests using a variety of synthetic and experimental spectra of single and multi metabolite mixtures show that MetaboHunter is able to identify, in average, more than 80% of detectable metabolites from spectra of synthetic mixtures and more than 50% from spectra corresponding to experimental mixtures. This work also suggests that better scoring functions improve by more than 30% the performance of MetaboHunter's metabolite identification methods. CONCLUSIONS MetaboHunter is a freely accessible, easy to use and user friendly 1H-NMR-based web server application that provides efficient data input and pre-processing, flexible parameter settings, fast and automatic metabolite fingerprinting and results visualization via intuitive plotting and compound peak hit maps. Compared to other published and freely accessible metabolomics tools, MetaboHunter implements three efficient methods to search for metabolites in manually curated data from two reference libraries.
Collapse
Affiliation(s)
- Dan Tulpan
- Institute for Information Technology, National Research Council of Canada, Moncton, New Brunswick, E1A 7R1, Canada.
| | | | | | | | | |
Collapse
|
17
|
Orekhov VY, Jaravine VA. Analysis of non-uniformly sampled spectra with multi-dimensional decomposition. PROGRESS IN NUCLEAR MAGNETIC RESONANCE SPECTROSCOPY 2011; 59:271-92. [PMID: 21920222 DOI: 10.1016/j.pnmrs.2011.02.002] [Citation(s) in RCA: 246] [Impact Index Per Article: 18.9] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/04/2011] [Accepted: 02/21/2011] [Indexed: 05/04/2023]
Affiliation(s)
- Vladislav Yu Orekhov
- Swedish NMR Centre, University of Gothenburg, Box 465, 40530 Gothenburg, Sweden.
| | | |
Collapse
|
18
|
Zheng C, Zhang S, Ragg S, Raftery D, Vitek O. Identification and quantification of metabolites in (1)H NMR spectra by Bayesian model selection. ACTA ACUST UNITED AC 2011; 27:1637-44. [PMID: 21398670 DOI: 10.1093/bioinformatics/btr118] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Nuclear magnetic resonance (NMR) spectroscopy is widely used for high-throughput characterization of metabolites in complex biological mixtures. However, accurate interpretation of the spectra in terms of identities and abundances of metabolites can be challenging, in particular in crowded regions with heavy peak overlap. Although a number of computational approaches for this task have recently been proposed, they are not entirely satisfactory in either accuracy or extent of automation. RESULTS We introduce a probabilistic approach Bayesian Quantification (BQuant), for fully automated database-based identification and quantification of metabolites in local regions of (1)H NMR spectra. The approach represents the spectra as mixtures of reference profiles from a database, and infers the identities and the abundances of metabolites by Bayesian model selection. We show using a simulated dataset, a spike-in experiment and a metabolomic investigation of plasma samples that BQuant outperforms the available automated alternatives in accuracy for both identification and quantification. AVAILABILITY The R package BQuant is available at: http://www.stat.purdue.edu/~ovitek/BQuant-Web/.
Collapse
Affiliation(s)
- Cheng Zheng
- Department of Statistics, Purdue University, West Lafayette, IN 47907, USA.
| | | | | | | | | |
Collapse
|
19
|
Zhong M, Girolami M, Faulds K, Graham D. Bayesian methods to detect dye-labelled DNA oligonucleotides in multiplexed Raman spectra. J R Stat Soc Ser C Appl Stat 2011. [DOI: 10.1111/j.1467-9876.2010.00744.x] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
20
|
Fertig EJ, Ding J, Favorov AV, Parmigiani G, Ochs MF. CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. ACTA ACUST UNITED AC 2010; 26:2792-3. [PMID: 20810601 DOI: 10.1093/bioinformatics/btq503] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
SUMMARY Coordinated Gene Activity in Pattern Sets (CoGAPS) provides an integrated package for isolating gene expression driven by a biological process, enhancing inference of biological processes from transcriptomic data. CoGAPS improves on other enrichment measurement methods by combining a Markov chain Monte Carlo (MCMC) matrix factorization algorithm (GAPS) with a threshold-independent statistic inferring activity on gene sets. The software is provided as open source C++ code built on top of JAGS software with an R interface. AVAILABILITY The R package CoGAPS and the C++ package GAPS-JAGS are provided open source under the GNU Lesser Public License (GLPL) with a users manual containing installation and operating instructions. CoGAPS is available through Bioconductor and depends on the rjags package available through CRAN to interface CoGAPS with GAPS-JAGS. URL: http://www.cancerbiostats.onc.jhmi.edu/cogaps.cfm .
Collapse
Affiliation(s)
- Elana J Fertig
- Department of Oncology and Division of Oncology, Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, MD 21205, USA.
| | | | | | | | | |
Collapse
|
21
|
Nonnegative matrix factorization with Gaussian process priors. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2010:361705. [PMID: 18464923 PMCID: PMC2367383 DOI: 10.1155/2008/361705] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2007] [Revised: 01/16/2008] [Accepted: 02/10/2008] [Indexed: 11/17/2022]
Abstract
We present a general method for including prior knowledge in a nonnegative matrix factorization (NMF), based on Gaussian process priors.
We assume that the nonnegative factors in the NMF are linked by a
strictly increasing function to an underlying Gaussian process specified
by its covariance function. This allows us to find NMF decompositions
that agree with our prior knowledge of the distribution of the factors, such
as sparseness, smoothness, and symmetries. The method is demonstrated
with an example from chemical shift brain imaging.
Collapse
|
22
|
Abstract
Numerous methods have been applied to microarray data to group genes into clusters that show similar expression patterns. These methods assign each gene to a single group, which does not reflect the widely held view among biologists that most, if not all, genes in eukaryotes are involved in multiple biological processes and therefore will be multiply regulated. Here, we review several methods of matrix factorisation that identify patterns of behaviour in transcriptional response and assign genes to multiple patterns. We focus on these methods rather than traditional clustering methods applied to microarray data, which assign one gene to one cluster.
Collapse
Affiliation(s)
| | - Michael F. Ochs
- Department of Oncology, Johns Hopkins University, 550 North Broadway, Suite 1103, Baltimore, MD 21205 USA
| |
Collapse
|
23
|
Ochs MF. Knowledge-based data analysis comes of age. Brief Bioinform 2010; 11:30-9. [PMID: 19854753 PMCID: PMC3700349 DOI: 10.1093/bib/bbp044] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2009] [Revised: 09/03/2009] [Indexed: 12/16/2022] Open
Abstract
The emergence of high-throughput technologies for measuring biological systems has introduced problems for data interpretation that must be addressed for proper inference. First, analysis techniques need to be matched to the biological system, reflecting in their mathematical structure the underlying behavior being studied. When this is not done, mathematical techniques will generate answers, but the values and reliability estimates may not accurately reflect the biology. Second, analysis approaches must address the vast excess in variables measured (e.g. transcript levels of genes) over the number of samples (e.g. tumors, time points), known as the 'large-p, small-n' problem. In large-p, small-n paradigms, standard statistical techniques generally fail, and computational learning algorithms are prone to overfit the data. Here we review the emergence of techniques that match mathematical structure to the biology, the use of integrated data and prior knowledge to guide statistical analysis, and the recent emergence of analysis approaches utilizing simple biological models. We show that novel biological insights have been gained using these techniques.
Collapse
Affiliation(s)
- Michael F Ochs
- Division of Oncology Biostatistics and Bioinformatics, 550 North Broadway, Suite 1103, Johns Hopkins University, Baltimore, MD 21205, USA.
| |
Collapse
|
24
|
Ochs MF, Rink L, Tarn C, Mburu S, Taguchi T, Eisenberg B, Godwin AK. Detection of treatment-induced changes in signaling pathways in gastrointestinal stromal tumors using transcriptomic data. Cancer Res 2009; 69:9125-32. [PMID: 19903850 DOI: 10.1158/0008-5472.can-09-1709] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Cell signaling plays a central role in the etiology of cancer. Numerous therapeutics in use or under development target signaling proteins; however, off-target effects often limit assignment of positive clinical response to the intended target. As direct measurements of signaling protein activity are not generally feasible during treatment, there is a need for more powerful methods to determine if therapeutics inhibit their targets and when off-target effects occur. We have used the Bayesian Decomposition algorithm and data on transcriptional regulation to create a novel methodology, Differential Expression for Signaling Determination (DESIDE), for inferring signaling activity from microarray measurements. We applied DESIDE to deduce signaling activity in gastrointestinal stromal tumor cell lines treated with the targeted therapeutic imatinib mesylate (Gleevec). We detected the expected reduced activity in the KIT pathway, as well as unexpected changes in the p53 pathway. Pursuing these findings, we have determined that imatinib-induced DNA damage is responsible for the increased activity of p53, identifying a novel off-target activity for this drug. We then used DESIDE on data from resected, post-imatinib treatment tumor samples and identified a pattern in these tumors similar to that at late time points in the cell lines, and this pattern correlated with initial clinical response. The pattern showed increased activity of ETS domain-containing protein Elk-1 and signal transducers and activators of transcription 3 transcription factors, which are associated with the growth of side population cells. DESIDE infers the global reprogramming of signaling networks during treatment, permitting treatment modification that leverages ongoing drug development efforts, which is crucial for personalized medicine.
Collapse
Affiliation(s)
- Michael F Ochs
- Division of Oncology Biostatistics and Bioinformatics, Johns Hopkins University, Baltimore, Maryland 21205, USA.
| | | | | | | | | | | | | |
Collapse
|
25
|
Kossenkov AV, Ochs MF. Matrix factorization for recovery of biological processes from microarray data. Methods Enzymol 2009; 467:59-77. [PMID: 19897089 PMCID: PMC2997652 DOI: 10.1016/s0076-6879(09)67003-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
We explore a number of matrix factorization methods in terms of their ability to identify signatures of biological processes in a large gene expression study. We focus on the ability of these methods to find signatures in terms of gene ontology enhancement and on the interpretation of these signatures in the samples. Two Bayesian approaches, Bayesian Decomposition (BD) and Bayesian Factor Regression Modeling (BFRM), perform best. Differences in the strength of the signatures between the samples suggest that BD will be most useful for systems modeling and BFRM for biomarker discovery.
Collapse
Affiliation(s)
- Andrew V. Kossenkov
- The Wistar Institute, 3601 Spruce Street, R214, Philadelphia, PA, 19104, Phone: 215-495-6898, Fax: 215-898-4521,
| | | |
Collapse
|
26
|
Schmidt MN, Winther O, Hansen LK. Bayesian Non-negative Matrix Factorization. INDEPENDENT COMPONENT ANALYSIS AND SIGNAL SEPARATION 2009. [DOI: 10.1007/978-3-642-00599-2_68] [Citation(s) in RCA: 94] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
|
27
|
Moussaoui S, Hauksdóttir H, Schmidt F, Jutten C, Chanussot J, Brie D, Douté S, Benediktsson J. On the decomposition of Mars hyperspectral data by ICA and Bayesian positive source separation. Neurocomputing 2008. [DOI: 10.1016/j.neucom.2007.07.034] [Citation(s) in RCA: 105] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
28
|
Du S, Mao X, Sajda P, Shungu DC. Automated tissue segmentation and blind recovery of (1)H MRS imaging spectral patterns of normal and diseased human brain. NMR IN BIOMEDICINE 2008; 21:33-41. [PMID: 17347991 DOI: 10.1002/nbm.1151] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/14/2023]
Abstract
Constrained non-negative matrix factorization (cNMF) with iterative data selection is described and demonstrated as a data analysis method for fast and automatic recovery of biochemically meaningful and diagnostically specific spectral patterns of the human brain from (1)H MRS imaging ((1)H MRSI) data. To achieve this goal, cNMF decomposes in vivo multidimensional (1)H MRSI data into two non-negative matrices representing (a) the underlying tissue-specific spectral patterns and (b) the spatial distribution of the corresponding metabolite concentrations. Central to the proposed approach is automatic iterative data selection which uses prior knowledge about the spatial distribution of the spectra to remove voxels that are due to artifacts and undesired metabolites/tissues such as the strong lipid and water components. The automatic recovery of diagnostic spectral patterns is demonstrated for long-TE (1)H MRSI data on normal human brain, multiple sclerosis, and serial brain tumor. The results show the ability of cNMF with iterative data selection to automatically and simultaneously recover tissue-specific spectral patterns and achieve segmentation of normal and diseased human brain tissue, concomitant with simplification of information content. These features of cNMF, which permit rapid recovery, reduction and interpretation of the complex diagnostic information content of large multi-dimensional spectroscopic imaging data sets, have the potential to enhance the clinical utility of in vivo(1)H MRSI.
Collapse
Affiliation(s)
- Shuyan Du
- Department of Biomedical Engineering, Columbia University, New York, NY, USA
| | | | | | | |
Collapse
|
29
|
Vehtari A, Mäkinen VP, Soininen P, Ingman P, Mäkelä SM, Savolainen MJ, Hannuksela ML, Kaski K, Ala-Korpela M. A novel Bayesian approach to quantify clinical variables and to determine their spectroscopic counterparts in 1H NMR metabonomic data. BMC Bioinformatics 2007; 8 Suppl 2:S8. [PMID: 17493257 PMCID: PMC1892077 DOI: 10.1186/1471-2105-8-s2-s8] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A key challenge in metabonomics is to uncover quantitative associations between multidimensional spectroscopic data and biochemical measures used for disease risk assessment and diagnostics. Here we focus on clinically relevant estimation of lipoprotein lipids by 1H NMR spectroscopy of serum. RESULTS A Bayesian methodology, with a biochemical motivation, is presented for a real 1H NMR metabonomics data set of 75 serum samples. Lipoprotein lipid concentrations were independently obtained for these samples via ultracentrifugation and specific biochemical assays. The Bayesian models were constructed by Markov chain Monte Carlo (MCMC) and they showed remarkably good quantitative performance, the predictive R-values being 0.985 for the very low density lipoprotein triglycerides (VLDL-TG), 0.787 for the intermediate, 0.943 for the low, and 0.933 for the high density lipoprotein cholesterol (IDL-C, LDL-C and HDL-C, respectively). The modelling produced a kernel-based reformulation of the data, the parameters of which coincided with the well-known biochemical characteristics of the 1H NMR spectra; particularly for VLDL-TG and HDL-C the Bayesian methodology was able to clearly identify the most characteristic resonances within the heavily overlapping information in the spectra. For IDL-C and LDL-C the resulting model kernels were more complex than those for VLDL-TG and HDL-C, probably reflecting the severe overlap of the IDL and LDL resonances in the 1H NMR spectra. CONCLUSION The systematic use of Bayesian MCMC analysis is computationally demanding. Nevertheless, the combination of high-quality quantification and the biochemical rationale of the resulting models is expected to be useful in the field of metabonomics.
Collapse
Affiliation(s)
- Aki Vehtari
- Laboratory of Computational Engineering, Systems Biology and Bioinformation Technology, Helsinki University of Technology, P.O. Box 9203, FI-02015 HUT, Finland
| | - Ville-Petteri Mäkinen
- Laboratory of Computational Engineering, Systems Biology and Bioinformation Technology, Helsinki University of Technology, P.O. Box 9203, FI-02015 HUT, Finland
| | - Pasi Soininen
- Department of Chemistry, University of Kuopio, P.O. Box 1627, FI-70211 Kuopio, Finland
| | - Petri Ingman
- Department of Chemistry, Instrument Centre, Vatselankatu 2, FI-20014 University of Turku, Turku, Finland
| | - Sanna M Mäkelä
- Department of Internal Medicine, Clinical Research Center, University of Oulu, P.O. Box 5000, FI-90014 Oulu, Finland
| | - Markku J Savolainen
- Department of Internal Medicine, Clinical Research Center, University of Oulu, P.O. Box 5000, FI-90014 Oulu, Finland
| | - Minna L Hannuksela
- Department of Internal Medicine, Clinical Research Center, University of Oulu, P.O. Box 5000, FI-90014 Oulu, Finland
| | - Kimmo Kaski
- Laboratory of Computational Engineering, Systems Biology and Bioinformation Technology, Helsinki University of Technology, P.O. Box 9203, FI-02015 HUT, Finland
| | - Mika Ala-Korpela
- Laboratory of Computational Engineering, Systems Biology and Bioinformation Technology, Helsinki University of Technology, P.O. Box 9203, FI-02015 HUT, Finland
| |
Collapse
|
30
|
Abstract
Expression data from knockout mutants is a powerful tool for gene function inference, permitting observation of the phenotype of a deleted gene on the organismal scale. A computational method is demonstrated herein to assess gene function from gene expression measured in deletion mutants using Bayesian decomposition, a matrix factorization technique that permits the extraction of patterns and functional units from the data, i.e., sets of genes belonging to the same pathways shared by sets of knockout mutants. ClutrFree, a cluster visualization program is used to aid in the interpretation of functional units and the assessment of gene functions for a subset of unknown genes.
Collapse
Affiliation(s)
- Ghislain Bidaut
- University of Pennsylvania School of Medicine, Philadelphia, USA
| |
Collapse
|
31
|
Abstract
Machine learning offers a principled approach for developing sophisticated, automatic, and objective algorithms for analysis of high-dimensional and multimodal biomedical data. This review focuses on several advances in the state of the art that have shown promise in improving detection, diagnosis, and therapeutic monitoring of disease. Key in the advancement has been the development of a more in-depth understanding and theoretical analysis of critical issues related to algorithmic construction and learning theory. These include trade-offs for maximizing generalization performance, use of physically realistic constraints, and incorporation of prior knowledge and uncertainty. The review describes recent developments in machine learning, focusing on supervised and unsupervised linear methods and Bayesian inference, which have made significant impacts in the detection and diagnosis of disease in biomedicine. We describe the different methodologies and, for each, provide examples of their application to specific domains in biomedical diagnostics.
Collapse
Affiliation(s)
- Paul Sajda
- Department of Biomedical Engineering, Columbia University, New York, NY 10027, USA.
| |
Collapse
|
32
|
Bidaut G, Suhre K, Claverie JM, Ochs MF. Determination of strongly overlapping signaling activity from microarray data. BMC Bioinformatics 2006; 7:99. [PMID: 16507110 PMCID: PMC1413561 DOI: 10.1186/1471-2105-7-99] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2005] [Accepted: 02/28/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND As numerous diseases involve errors in signal transduction, modern therapeutics often target proteins involved in cellular signaling. Interpretation of the activity of signaling pathways during disease development or therapeutic intervention would assist in drug development, design of therapy, and target identification. Microarrays provide a global measure of cellular response, however linking these responses to signaling pathways requires an analytic approach tuned to the underlying biology. An ongoing issue in pattern recognition in microarrays has been how to determine the number of patterns (or clusters) to use for data interpretation, and this is a critical issue as measures of statistical significance in gene ontology or pathways rely on proper separation of genes into groups. RESULTS Here we introduce a method relying on gene annotation coupled to decompositional analysis of global gene expression data that allows us to estimate specific activity on strongly coupled signaling pathways and, in some cases, activity of specific signaling proteins. We demonstrate the technique using the Rosetta yeast deletion mutant data set, decompositional analysis by Bayesian Decomposition, and annotation analysis using ClutrFree. We determined from measurements of gene persistence in patterns across multiple potential dimensionalities that 15 basis vectors provides the correct dimensionality for interpreting the data. Using gene ontology and data on gene regulation in the Saccharomyces Genome Database, we identified the transcriptional signatures of several cellular processes in yeast, including cell wall creation, ribosomal disruption, chemical blocking of protein synthesis, and, critically, individual signatures of the strongly coupled mating and filamentation pathways. CONCLUSION This works demonstrates that microarray data can provide downstream indicators of pathway activity either through use of gene ontology or transcription factor databases. This can be used to investigate the specificity and success of targeted therapeutics as well as to elucidate signaling activity in normal and disease processes.
Collapse
Affiliation(s)
- Ghislain Bidaut
- Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA, 19111, USA
- Structural and Genomic Information Laboratory, UPR2589-CNRS, 13288 Marseille, France
- Center for Bioinformatics, Department of Genetics, University of Pennsylvania School of Medicine, 1423 Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104-6021, USA
| | - Karsten Suhre
- Structural and Genomic Information Laboratory, UPR2589-CNRS, 13288 Marseille, France
| | - Jean-Michel Claverie
- Structural and Genomic Information Laboratory, UPR2589-CNRS, 13288 Marseille, France
| | - Michael F Ochs
- Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA, 19111, USA
| |
Collapse
|
33
|
Bidaut G, Suhre K, Claverie JM, Ochs MF. Bayesian decomposition analysis of bacterial phylogenomic profiles. ACTA ACUST UNITED AC 2005; 5:63-70. [PMID: 15727490 DOI: 10.2165/00129785-200505010-00006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
BACKGROUND The past two decades have seen the appearance of new infectious diseases and the reemergence of old diseases previously thought to be under control. At the same time, the effectiveness of the existing antibacterials is rapidly decreasing due to the spread of multidrug-resistant pathogens. AIM The aim of this study was to the identify candidate molecular targets (e.g. enzymes) within essential metabolic pathways specific to a significant subset of bacterial pathogens as the first step in the rational design of new antibacterial drugs. METHODS We constructed a dataset of phylogenomic profiles (vectors that encode the similarity, measured by BLAST scores, of a gene across many species) for a series of 31 pathogenic bacteria of interest with 1073 genes taken from the reference organisms Escherichia coli and Mycobacterium tuberculosis. We applied Bayesian Decomposition, a matrix decomposition algorithm, to identify functional metabolic units comprising overlapping sets of genes in this dataset. RESULTS Although no information on phylogeny was provided to the system, Bayesian Decomposition retrieved the known bacteria phylogenic relationships on the basis of the proteins necessary for survival. In addition, a set of genes required by all bacteria was identified, as well as components and enzymes specific to subsets of bacteria. CONCLUSION The use of phylogenomic profiles and Bayesian Decomposition provide important insights for the design of new antibacterial therapeutics.
Collapse
Affiliation(s)
- Ghislain Bidaut
- Division of Population Science, Bioinformatics, Fox Chase Cancer Center, Philadelphia, PA 19111, USA
| | | | | | | |
Collapse
|
34
|
Sajda P, Du S, Brown TR, Stoyanova R, Shungu DC, Mao X, Parra LC. Nonnegative matrix factorization for rapid recovery of constituent spectra in magnetic resonance chemical shift imaging of the brain. IEEE TRANSACTIONS ON MEDICAL IMAGING 2004; 23:1453-65. [PMID: 15575404 DOI: 10.1109/tmi.2004.834626] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
We present an algorithm for blindly recovering constituent source spectra from magnetic resonance (MR) chemical shift imaging (CSI) of the human brain. The algorithm, which we call constrained nonnegative matrix factorization (cNMF), does not enforce independence or sparsity, instead only requiring the source and mixing matrices to be nonnegative. It is based on the nonnegative matrix factorization (NMF) algorithm, extending it to include a constraint on the positivity of the amplitudes of the recovered spectra. This constraint enables recovery of physically meaningful spectra even in the presence of noise that causes a significant number of the observation amplitudes to be negative. We demonstrate and characterize the algorithm's performance using 31P volumetric brain data, comparing the results with two different blind source separation methods: Bayesian spectral decomposition (BSD) and nonnegative sparse coding (NNSC). We then incorporate the cNMF algorithm into a hierarchical decomposition framework, showing that it can be used to recover tissue-specific spectra given a processing hierarchy that proceeds coarse-to-fine. We demonstrate the hierarchical procedure on 1H brain data and conclude that the computational efficiency of the algorithm makes it well-suited for use in diagnostic work-up.
Collapse
Affiliation(s)
- Paul Sajda
- Laboratory of Intelligent Imaging and Neural Computing, Department of Biomedical Engineering, Columbia University, 351 Engineering Terrace Building, Mail Code 8904, 1210 Amsterdam Ave., New York, NY 10027, USA.
| | | | | | | | | | | | | |
Collapse
|
35
|
Ochs MF, Moloshok TD, Bidaut G, Toby G. Bayesian decomposition: analyzing microarray data within a biological context. Ann N Y Acad Sci 2004; 1020:212-26. [PMID: 15208194 DOI: 10.1196/annals.1310.018] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The detection and correct identification of cancer, especially at an early stage, are vitally important for patient survival and quality of life. Since signaling pathways play critical roles in cancer development and metastasis, methods that reliably assess the activity of these pathways are critical to understand cancer and the response to therapy. Bayesian Decomposition (BD) identifies signatures of expression that can be linked directly to signaling pathway activity, allowing the changes in mRNA levels to be used as downstream indicators of pathway activity. Here, we demonstrate this ability by identifying the downstream expression signal associated with the mating response in Saccharomyces cerevisiae and showing that this signal disappears in deletion mutants of genes critical to the MAPK signaling cascade used to trigger the response. We also show the use of BD in the context of supervised learning, by analyzing the Mus musculus tissue-specific data set provided by Project Normal. The algorithm correctly removes routine metabolic processes, allowing tissue-specific signatures of expression to be identified. Gene ontology is used to interpret these signatures. Since a number of modern therapeutics specifically target signaling proteins, it is important to be able to identify changes in signaling pathways in order to use microarray data to interpret cancer response. By removing routine metabolic signatures and linking specific signatures to signaling pathway activity, BD makes it possible to link changes in microarray results to signaling pathways.
Collapse
Affiliation(s)
- Michael F Ochs
- Bioinformatics, Division of Population Science, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA 19111, USA.
| | | | | | | |
Collapse
|
36
|
Ladroue C, Howe FA, Griffiths JR, Tate AR. Independent component analysis for automated decomposition of in vivo magnetic resonance spectra. Magn Reson Med 2004; 50:697-703. [PMID: 14523954 DOI: 10.1002/mrm.10595] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Fully automated methods for analyzing MR spectra would be of great benefit for clinical diagnosis, in particular for the extraction of relevant information from large databases for subsequent pattern recognition analysis. Independent component analysis (ICA) provides a means of decomposing signals into their constituent components. This work investigates the use of ICA for automatically extracting features from in vivo MR spectra. After its limits are assessed on artificial data, the method is applied to a set of brain tumor spectra. ICA automatically, and in an unsupervised fashion, decomposes the signals into interpretable components. Moreover, the spectral decomposition achieved by the ICA leads to the separation of some tissue types, which confirms the biochemical relevance of the components.
Collapse
Affiliation(s)
- Christophe Ladroue
- CR-UK Biomedical Magnetic Resonance Research Group, Basic Medical Sciences Department, London, UK.
| | | | | | | |
Collapse
|
37
|
|
38
|
Stoyanova R, Brown TR. NMR spectral quantitation by principal component analysis. NMR IN BIOMEDICINE 2001; 14:271-277. [PMID: 11410945 DOI: 10.1002/nbm.700] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
The use of principal component analysis (PCA) for simultaneous spectral quantitation of a single resonant peak across a series of spectra has gained popularity among the NMR community. The approach is fast, requires no assumptions regarding the peak lineshape and provides quantitation even for peaks with very low signal-to-noise ratio. PCA produces estimates of all peak parameters: area, frequency, phase and linewidth. If desired, these estimates can be used to correct the original data so that the peak in all spectra has the same lineshape. This ability makes PCA useful not only for direct peak quantitation, but also for processing spectral data prior to application of pattern recognition/classification techniques. This article briefly reviews the theoretical basis of PCA for spectral quantitation, addresses issues of data processing prior to PCA, describes suitable and unsuitable datasets for PCA applications and summarizes the developments and the limitations of the method.
Collapse
Affiliation(s)
- R Stoyanova
- Fox Chase Cancer Center, 7701 Burholme Avenue, Philadelphia, PA 19111, USA
| | | |
Collapse
|
39
|
Abstract
A multiscale approach for analyzing in vivo magnetic resonance spectroscopic imaging (SI) data is described in this paper. With this method, fitting is performed at multiple spatial scales in a coarse-to-fine order. Results obtained at one scale are used as prior knowledge in fitting spectra at the next scale. The multiscale approach was validated with simulated data and demonstrated with proton SI datasets of the human brains. The results showed that this method improved the robustness and efficiency of the fitting and facilitated the automatic analysis of in vivo SI data.
Collapse
Affiliation(s)
- X Zhang
- Department of Radiology, University of Minnesota Medical School, Minneapolis 55455, USA
| | | | | | | |
Collapse
|
40
|
Miskin J, MacKay DJC. Ensemble Learning for Blind Image Separation and Deconvolution. ADVANCES IN INDEPENDENT COMPONENT ANALYSIS 2000. [DOI: 10.1007/978-1-4471-0443-8_7] [Citation(s) in RCA: 73] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|